1
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. Bull Math Biol 2025; 87:84. [PMID: 40410602 PMCID: PMC12102126 DOI: 10.1007/s11538-025-01464-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 05/08/2025] [Indexed: 05/25/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than 10,000 diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
2
|
Shpak M, Lawrence KN, Pool JE. The Precision and Power of Population Branch Statistics in Identifying the Genomic Signatures of Local Adaptation. Genome Biol Evol 2025; 17:evaf080. [PMID: 40326284 PMCID: PMC12095133 DOI: 10.1093/gbe/evaf080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 04/21/2025] [Accepted: 04/29/2025] [Indexed: 05/07/2025] Open
Abstract
Population branch statistics, which estimate the degree of genetic differentiation along a focal population's lineage, have been used as an alternative to FST-based genome-wide scans for identifying loci associated with local selective sweeps. Beyond the population branch statistic (PBS), the normalized PBSn1 adjusts focal branch length with respect to outgroup branch lengths at the same locus, whereas population branch excess (PBE) incorporates median branch lengths at other loci. PBSn1 and PBE were proposed to be more specific to local selective sweeps as opposed to geographically ubiquitous selection. However, the accuracy and statistical power of branch statistics have not been systematically assessed. To do so, we simulate genomes in representative large and small populations with varying proportions of sites evolving under genetic drift or (approximated) background selection, with local selective sweeps or geographically parallel selective sweeps. We then assess the probability that local selective sweep loci are correctly identified as outliers by FST and by each of the branch statistics. We find that branch statistics consistently outperform FST at identifying local sweeps. Particularly when parallel sweeps are introduced, PBSn1 and PBE correctly identify local sweeps among their top outliers more frequently than PBS. Additionally, we evaluate versions of these statistics based on maximal site differentiation within a window, finding that site-based PBE and PBSn1 are particularly effective at identifying local soft sweeps. These results validate the greater specificity of the rescaled branch statistics PBE and PBSn1 to detect population-specific positive selection, supporting their use in genomic studies focused on local adaptation.
Collapse
Affiliation(s)
- Max Shpak
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| | - Kadee N Lawrence
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| | - John E Pool
- Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI, USA
| |
Collapse
|
3
|
Arnab SP, Campelo dos Santos AL, Fumagalli M, DeGiorgio M. Efficient Detection and Characterization of Targets of Natural Selection Using Transfer Learning. Mol Biol Evol 2025; 42:msaf094. [PMID: 40341942 PMCID: PMC12062966 DOI: 10.1093/molbev/msaf094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 04/16/2025] [Accepted: 04/17/2025] [Indexed: 05/11/2025] Open
Abstract
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pretrained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | | | - Matteo Fumagalli
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
- The Alan Turing Institute, London, UK
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
4
|
Herzig AF, Rubinacci S, Marenne G, Perdry H, Deleuze JF, Dina C, Barc J, Redon R, Delaneau O, Génin E. SURFBAT: a surrogate family based association test building on large imputation reference panels. G3 (BETHESDA, MD.) 2025; 15:jkae287. [PMID: 39657733 PMCID: PMC12005154 DOI: 10.1093/g3journal/jkae287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 11/07/2024] [Accepted: 11/29/2024] [Indexed: 12/12/2024]
Abstract
Genotype-phenotype association tests are typically adjusted for population stratification using principal components that are estimated genome-wide. This lacks resolution when analyzing populations with fine structure and/or individuals with fine levels of admixture. This can affect power and precision, and is a particularly relevant consideration when control individuals are recruited using geographic selection criteria. Such is the case in France where we have recently created reference panels of individuals anchored to different geographic regions. To make correct comparisons against case groups, who would likely be gathered from large urban areas, new methods are needed. We present SURFBAT (a surrogate family based association test), which performs an approximation of the transmission-disequilibrium test. Our method hinges on the application of genotype imputation algorithms to match similar haplotypes between the case and control groups. This permits us to approximate local ancestry informed posterior probabilities of un-transmitted parental alleles of each case individual. This is achieved by assuming haplotypes from the imputation panel are well-matched for ancestry with the case individuals. When the first haplotype of an individual from the imputation panel matches that of a case individual, it is assumed that the second haplotype of the same reference individual can be used as a locally ancestry matched control haplotype and to approximately impute un-transmitted parental alleles. SURFBAT provides an association test that is inherently robust to fine-scale population stratification and opens up the possibility of efficiently using large imputation reference panels as control groups for association testing. In contrast to other methods for association testing that incorporate local-ancestry inference, SURFBAT does not require a set of ancestry groups to be defined, nor for local ancestry to be explicitly estimated. We demonstrate the interest of our tool on simulated datasets, as well as on a real-data example for a group of case individuals affected by Brugada syndrome.
Collapse
Affiliation(s)
- Anthony F Herzig
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
| | - Simone Rubinacci
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki 00290, Finland
| | - Gaëlle Marenne
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
| | - Hervé Perdry
- CESP Inserm U1018, Université Paris-Saclay, Villejuif F-94807, France
| | - Jean-François Deleuze
- Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry F-91000, France
- CEPH, Fondation Jean Dausset, Paris F-75010, France
| | - Christian Dina
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | - Julien Barc
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | - Richard Redon
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | | | - Emmanuelle Génin
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
- CHU Brest, Brest F-29200, France
| |
Collapse
|
5
|
Strütt S, Excoffier L, Peischl S. A generalized structured coalescent for purifying selection without recombination. Genetics 2025; 229:iyaf013. [PMID: 39862229 DOI: 10.1093/genetics/iyaf013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Revised: 12/18/2024] [Accepted: 12/30/2024] [Indexed: 01/27/2025] Open
Abstract
Purifying selection is a critical factor in shaping genetic diversity. Current theoretical models mostly address scenarios of either very weak or strong selection, leaving a significant gap in our knowledge. The effects of purifying selection on patterns of genomic diversity remain poorly understood when selection against deleterious mutations is weak to moderate, particularly when recombination is limited or absent. In this study, we extend an existing approach, the fitness-class coalescent, to incorporate arbitrary levels of purifying selection in haploid populations. This model offers a comprehensive framework for exploring the influence of purifying selection in a wide range of demographic scenarios. Moreover, our research reveals potential sources of qualitative and quantitative biases in demographic inference, highlighting the significant risk of attributing genetic patterns to past demographic events rather than purifying selection. This work expands our understanding of the complex interplay between selection, drift, and population dynamics, and how purifying selection distorts demographic inference.
Collapse
Affiliation(s)
- Stefan Strütt
- Interfaculty Bioinformatics Unit, University of Bern, Baltzerstrasse 6, Bern 3012, Switzerland
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Baltzerstrasse 6, Bern 3012, Switzerland
| | - Laurent Excoffier
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Baltzerstrasse 6, Bern 3012, Switzerland
| | - Stephan Peischl
- Interfaculty Bioinformatics Unit, University of Bern, Baltzerstrasse 6, Bern 3012, Switzerland
- Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| |
Collapse
|
6
|
Fan C, Cahoon JL, Dinh BL, Ortega-Del Vecchyo D, Huber CD, Edge MD, Mancuso N, Chiang CWK. A likelihood-based framework for demographic inference from genealogical trees. Nat Genet 2025; 57:865-874. [PMID: 40113903 DOI: 10.1038/s41588-025-02129-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 02/14/2025] [Indexed: 03/22/2025]
Abstract
The demographic history of a population underlies patterns of genetic variation and is encoded in the gene-genealogical trees of the sampled haplotypes. Here we propose a demographic inference framework called the genealogical likelihood (gLike). Our method uses a graph-based structure to summarize the relationships among all lineages in a gene-genealogical tree with all possible trajectories of population memberships through time and derives the full likelihood across trees under a parameterized demographic model. We show through simulations and empirical applications that for populations that have experienced multiple admixtures, gLike can accurately estimate dozens of demographic parameters, including ancestral population sizes, admixture timing and admixture proportions, and it outperforms conventional demographic inference methods using the site frequency spectrum. Taken together, our proposed gLike framework harnesses underused genealogical information to offer high sensitivity and accuracy in inferring complex demographies for humans and other species.
Collapse
Affiliation(s)
- Caoqi Fan
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| | - Jordan L Cahoon
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Department of Computer Science, University of Southern California, Los Angeles, CA, USA
| | - Bryan L Dinh
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Diego Ortega-Del Vecchyo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Querétaro, México
| | - Christian D Huber
- Department of Biology, Penn State University, University Park, PA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
7
|
Gower G, Pope NS, Rodrigues MF, Tittes S, Tran LN, Alam O, Cavassim MIA, Fields PD, Haller BC, Huang X, Jeffrey B, Korfmann K, Kyriazis CC, Min J, Rebollo I, Rehmann CT, Small ST, Smith CCR, Tsambos G, Wong Y, Zhang Y, Huber CD, Gorjanc G, Ragsdale AP, Gronau I, Gutenkunst RN, Kelleher J, Lohmueller KE, Schrider DR, Ralph PL, Kern AD. Accessible, realistic genome simulation with selection using stdpopsim. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.23.644823. [PMID: 40166307 PMCID: PMC11957135 DOI: 10.1101/2025.03.23.644823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Selection is a fundamental evolutionary force that shapes patterns of genetic variation across species. However, simulations incorporating realistic selection along heterogeneous genomes in complex demographic histories are challenging, limiting our ability to benchmark statistical methods aimed at detecting selection and to explore theoretical predictions. stdpopsim is a community-maintained simulation library that already provides an extensive catalog of species-specific population genetic models. Here we present a major extension to the stdpopsim framework that enables simulation of various modes of selection, including background selection, selective sweeps, and arbitrary distributions of fitness effects (DFE) acting on annotated subsets of the genome (for instance, exons). This extension maintains stdpopsim's core principles of reproducibility and accessibility while adding support for species-specific genomic annotations and published DFE estimates. We demonstrate the utility of this framework by benchmarking methods for demographic inference, DFE estimation, and selective sweep detection across several species and scenarios. Our results demonstrate the robustness of demographic inference methods to selection on linked sites, reveal the sensitivity of DFE-inference methods to model assumptions, and show how genomic features, like recombination rate and functional sequence density, influence power to detect selective sweeps. This extension to stdpopsim provides a powerful new resource for the population genetics community to explore the interplay between selection and other evolutionary forces in a reproducible, low-barrier framework.
Collapse
Affiliation(s)
- Graham Gower
- Section for Molecular Ecology and Evolution, Globe Institute, University of Copenhagen, Denmark
| | - Nathaniel S Pope
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Murillo F Rodrigues
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
- Division of Genetics, Oregon National Primate Center, Oregon Health and Science University, Beaverton, OR, 97006, USA
| | - Silas Tittes
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Linh N Tran
- Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Ornob Alam
- Center for Genomics & Systems Biology, New York University, New York, NY, 10003, USA
| | - Maria Izabel A Cavassim
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Benjamin C Haller
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Ben Jeffrey
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Kevin Korfmann
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Christopher C Kyriazis
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jiseon Min
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Inés Rebollo
- Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, 55108, USA
| | - Clara T Rehmann
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Scott T Small
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Chris C R Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| | - Georgia Tsambos
- Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Yu Zhang
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Christian D Huber
- Department of Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, USA
| | - Ilan Gronau
- Efi Arazi School of Computer Science, Reichman University, Herzliya 4610101, Israel
| | - Ryan N Gutenkunst
- Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ, 85721, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
- Department of Data Science, University of Oregon, Eugene, OR, 97402, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97402, USA
| |
Collapse
|
8
|
Shastry V, Berg JJ. Allele ages provide limited information about the strength of negative selection. Genetics 2025; 229:iyae211. [PMID: 39698825 PMCID: PMC11912868 DOI: 10.1093/genetics/iyae211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 12/12/2024] [Indexed: 12/20/2024] Open
Abstract
For many problems in population genetics, it is useful to characterize the distribution of fitness effects (DFE) of de novo mutations among a certain class of sites. A DFE is typically estimated by fitting an observed site frequency spectrum (SFS) to an expected SFS given a hypothesized distribution of selection coefficients and demographic history. The development of tools to infer gene trees from haplotype alignments, along with ancient DNA resources, provides us with additional information about the frequency trajectories of segregating mutations. Here, we ask how useful this additional information is for learning about the DFE, using the joint distribution on allele frequency and age to summarize information about the trajectory. To this end, we introduce an accurate and efficient numerical method for computing the density on the age of a segregating variant found at a given sample frequency, given the strength of selection and an arbitrarily complex population size history. We then use this framework to show that the unconditional age distribution of negatively selected alleles is very closely approximated by reweighting the neutral age distribution in terms of the negatively selected SFS, suggesting that allele ages provide little information about the DFE beyond that already contained in the present day frequency. To confirm this prediction, we extended the standard Poisson random field method to incorporate the joint distribution of frequency and age in estimating selection coefficients, and test its performance using simulations. We find that when the full SFS is observed and the true allele ages are known, including ages in the estimation provides only small increases in the accuracy of estimated selection coefficients. However, if only sites with frequencies above a certain threshold are observed, then the true ages can provide substantial information about the selection coefficients, especially when the selection coefficient is large. When ages are estimated from haplotype data using state-of-the-art tools, uncertainty about the age abrogates most of the additional information in the fully observed SFS case, while the neutral prior assumed in these tools when estimating ages induces a downward bias in the case of the thresholded SFS.
Collapse
Affiliation(s)
- Vivaswat Shastry
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA
| | - Jeremy J Berg
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
9
|
Arnab SP, Dos Santos ALC, Fumagalli M, DeGiorgio M. Efficient detection and characterization of targets of natural selection using transfer learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.05.641710. [PMID: 40093065 PMCID: PMC11908262 DOI: 10.1101/2025.03.05.641710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pre-trained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | | | - Matteo Fumagalli
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
- The Alan Turing Institute, London, UK
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
10
|
Lehmann B, Lee H, Anderson-Trocmé L, Kelleher J, Gorjanc G, Ralph PL. On ARGs, pedigrees, and genetic relatedness matrices. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.03.641310. [PMID: 40093116 PMCID: PMC11908205 DOI: 10.1101/2025.03.03.641310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Genetic relatedness is a central concept in genetics, underpinning studies of population and quantitative genetics in human, animal, and plant settings. It is typically stored as a genetic relatedness matrix (GRM), whose elements are pairwise relatedness values between individuals. This relatedness has been defined in various contexts based on pedigree, genotype, phylogeny, coalescent times, and, recently, ancestral recombination graph (ARG). ARG-based GRMs have been found to better capture the structure of a population and improve association studies relative to the genotype GRM. However, calculating GRMs and further operations with them is fundamentally challenging due to inherent quadratic time and space complexity. Here, we first discuss the different definitions of relatedness in a unifying context, making use of the additive model of a quantitative trait to provide a definition of "branch relatedness" and the corresponding "branch GRM". We explore the relationship between branch relatedness and pedigree relatedness through a case study of French-Canadian individuals that have a known pedigree. Through the tree sequence encoding of an ARG, we then derive an efficient algorithm for computing products between the branch GRM and a general vector, without explicitly forming the branch GRM. This algorithm leverages the sparse encoding of genomes with the tree sequence and hence enables large-scale computations with the branch GRM. We demonstrate the power of this algorithm by developing a randomized principal components algorithm for tree sequences that easily scales to millions of genomes. All algorithms are implemented in the open source tskit Python package. Taken together, this work consolidates the different notions of relatedness as branch relatedness and by leveraging the tree sequence encoding of an ARG it provides efficient algorithms that enable computations with the branch GRM that scale to mega-scale genomic datasets.
Collapse
Affiliation(s)
- Brieuc Lehmann
- Department of Statistical Science, University College London, WC1E 7HB, UK
| | - Hanbin Lee
- Department of Statistics, University of Michigan, Ann Arbor MI 48109, USA
| | | | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene OR 97402, USA
- Department of Data Science, University of Oregon, Eugene OR 97402, USA
| |
Collapse
|
11
|
Haag J, Jordan AI, Stamatakis A. Pandora: a tool to estimate dimensionality reduction stability of genotype data. BIOINFORMATICS ADVANCES 2025; 5:vbaf040. [PMID: 40160475 PMCID: PMC11955236 DOI: 10.1093/bioadv/vbaf040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Accepted: 02/27/2025] [Indexed: 04/02/2025]
Abstract
Motivation Genotype datasets typically contain a large number of single-nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual's origin or membership to a population, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty. Results Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k -means clustering approach to assess the uncertainty of assignments to potential cultural groups. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques. Availability and implementation Pandora is available on GitHub: https://github.com/tschuelia/Pandora.
Collapse
Affiliation(s)
- Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Alexander I Jordan
- Computational Statistics Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology—Hellas, Heraklion, Crete 70013, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| |
Collapse
|
12
|
Pivirotto A, Peles N, Hey J. Allele age estimators designed for whole genome datasets show only a moderate reduction in performance when applied to whole exome datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.02.01.578465. [PMID: 38370640 PMCID: PMC10871225 DOI: 10.1101/2024.02.01.578465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Personalized genomics in the healthcare system is becoming increasingly accessible as the costs of sequencing decreases. With the increase in the number of genomes, larger numbers of rare variants are being discovered, leading to important initiatives in identifying the functional impacts in relation to disease phenotypes. One way to characterize these variants is to estimate the time the mutation entered the population. However, allele age estimators such as those implemented in the programs Relate, Genealogical Estimator of Variant Age (GEVA), and Runtc, were developed based on the assumption that datasets include the entire genome. We examined the performance of each of these estimators on simulated exome data under a neutral constant population size model, as well as under population expansion and background selection models. We found that each provides usable estimates of allele age from whole-exome datasets. Relate performs the best amongst all three estimators with Pearson coefficients of 0.83 and 0.73 (with respect to true simulated values, for neutral constant and expansion population model, respectively) with a 12 percent and 20 percent decrease in correlation between whole genome and whole exome estimations. Of the three estimators, Relate is best able to parallelize to yield quick results with little resources, however, Relate is currently only able to scale to thousands of samples making it unable to match the hundreds of thousands of samples being currently released. While more work is needed to expand the capabilities of current methods of estimating allele age, these methods show a modest decrease in performance in the estimation of the age of mutations.
Collapse
Affiliation(s)
- Alyssa Pivirotto
- Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA USA
| | - Noah Peles
- Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA USA
| | - Jody Hey
- Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA USA
| |
Collapse
|
13
|
Fritze H, Pope N, Kelleher J, Ralph P. A forest is more than its trees: haplotypes and ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.30.626138. [PMID: 40060605 PMCID: PMC11888177 DOI: 10.1101/2024.11.30.626138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2025]
Abstract
Foreshadowing haplotype-based methods of the genomics era, it is an old observation that the "junction" between two distinct haplotypes produced by recombination is inherited as a Mendelian marker. In a genealogical context, this recombination-mediated information reflects the persistence of ancestral haplotypes across local genealogical trees in which they do not represent coalescences. We show how these non-coalescing haplotypes ("locally-unary nodes") may be inserted into ancestral recombination graphs (ARGs), a compact but information-rich data structure describing the genealogical relationships among recombinant sequences. The resulting ARGs are smaller, faster to compute with, and the additional ancestral information that is inserted is nearly always correct where the initial ARG is correct. We provide efficient algorithms to infer locally-unary nodes within existing ARGs, and explore some consequences for ARGs inferred from real data. To do this, we introduce new metrics of agreement and disagreement between ARGs that, unlike previous methods, consider ARGs as describing relationships between haplotypes rather than just a collection of trees.
Collapse
Affiliation(s)
- Halley Fritze
- Department of Mathematics, University of Oregon, Eugene, Oregon
| | - Nathaniel Pope
- Institute of Evolution and Ecology and Department of Biology, University of Oregon, Eugene, Oregon
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford
| | - Peter Ralph
- Institute of Evolution and Ecology and Department of Biology, University of Oregon, Eugene, Oregon
- Department of Mathematics, University of Oregon, Eugene, Oregon
- Department of Data Science, University of Oregon, Eugene, Oregon
| |
Collapse
|
14
|
DeHaas D, Wei X. IGD: A simple, efficient genotype data format. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.05.636549. [PMID: 39974956 PMCID: PMC11838554 DOI: 10.1101/2025.02.05.636549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Motivation While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement - yet fast and small - is helpful for research on highly scalable bioinformatics. Results We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100 times faster and 3.5 times smaller than vcf.gz on Biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format. Availability A C++ library reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY
| |
Collapse
|
15
|
Mowlaei ME, Li C, Jamialahmadi O, Dias R, Chen J, Jamialahmadi B, Rebbeck TR, Carnevale V, Kumar S, Shi X. STICI: Split-Transformer with integrated convolutions for genotype imputation. Nat Commun 2025; 16:1218. [PMID: 39890780 PMCID: PMC11785734 DOI: 10.1038/s41467-025-56273-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 01/08/2025] [Indexed: 02/03/2025] Open
Abstract
Despite advances in sequencing technologies, genome-scale datasets often contain missing bases and genomic segments, hindering downstream analyses. Genotype imputation addresses this issue and has been a cornerstone pre-processing step in genetic and genomic studies. Although various methods have been widely adopted for genotype imputation, it remains challenging to impute certain genomic regions and large structural variants. Here, we present a transformer-based framework, named STICI, for accurate genotype imputation. STICI models automatically learn genome-wide patterns of linkage disequilibrium, evidenced by much higher imputation accuracy in regions with highly linked variants. Our imputation results on the human 1000 Genomes Project and non-human genomes show that STICI can achieve high imputation accuracy comparable to the state-of-the-art genotype imputation methods, with the additional capability to impute multi-allelic variants and various types of genetic variants. STICI can be trained for any collection of genomes automatically using self-supervision. Moreover, STICI shows excellent performance without needing any special presuppositions about the underlying patterns in collections of non-human genomes, pointing to adaptability and applications of STICI to impute missing genotypes in any species.
Collapse
Affiliation(s)
- Mohammad Erfan Mowlaei
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
| | - Chong Li
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
| | - Oveis Jamialahmadi
- Department of Molecular and Clinical Medicine, Institute of Medicine, Sahlgrenska Academy, Wallenberg Laboratory, University of Gothenburg, Gothenburg, Sweden
| | - Raquel Dias
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Benyamin Jamialahmadi
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Timothy Richard Rebbeck
- Division of Population Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Vincenzo Carnevale
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA
| | - Sudhir Kumar
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
| | - Xinghua Shi
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA.
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
16
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. Genetics 2025; 229:1-57. [PMID: 39503241 PMCID: PMC11708920 DOI: 10.1093/genetics/iyae180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 10/18/2024] [Indexed: 11/13/2024] Open
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q, and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q. Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward; thus, it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q. In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q.
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
17
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.13.628449. [PMID: 39829821 PMCID: PMC11741331 DOI: 10.1101/2024.12.13.628449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than ten thousand diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, WA, USA
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA
| | | | | |
Collapse
|
18
|
Zhao H, Alachiotis N. Data preprocessing methods for selective sweep detection using convolutional neural networks. Methods 2025; 233:19-29. [PMID: 39550020 DOI: 10.1016/j.ymeth.2024.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 10/28/2024] [Accepted: 11/04/2024] [Indexed: 11/18/2024] Open
Abstract
The identification of positive selection has been framed as a classification task, with Convolutional Neural Networks (CNNs) already outperforming summary statistics and likelihood-based approaches in accuracy. Despite the prevalence of CNN-based methods that manipulate the pixels of images representing raw genomic data as a preprocessing step to improve classification accuracy, the efficacy of these pixel-rearrangement techniques remains inadequately examined, particularly in the presence of confounding factors like population bottlenecks, migration and recombination hotspots. We introduce a set of pixel rearrangement algorithms aimed at enhancing CNN classification accuracy in detecting selective sweeps. These algorithms are employed to assess the performance of four CNN models for selective sweep detection. Our findings illustrate that the judicious application of rearrangement algorithms notably enhances the overall classification accuracy of a CNN across various datasets simulating confounding factors. We observed that sorting the columns of the genomic matrices has higher on CNN performance than rearranging the sequences. To some extent, these rearrangement algorithms are more robust to misspecified demographic models compared with the utilization of the default preprocessing algorithm as suggested by the respective authors of each CNN architecture. We provide the data rearrangement algorithms as a distinct package available for download at: https://github.com/Zhaohq96/Genetic-data-rearrangement.
Collapse
Affiliation(s)
- Hanqing Zhao
- University of Twente, Drienerlolaan 5, Enschede, 7522 NB, Overijssel, the Netherlands.
| | - Nikolaos Alachiotis
- University of Twente, Drienerlolaan 5, Enschede, 7522 NB, Overijssel, the Netherlands.
| |
Collapse
|
19
|
Osmond M, Coop G. Estimating dispersal rates and locating genetic ancestors with genome-wide genealogies. eLife 2024; 13:e72177. [PMID: 39589398 DOI: 10.7554/elife.72177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Accepted: 11/24/2024] [Indexed: 11/27/2024] Open
Abstract
Spatial patterns in genetic diversity are shaped by individuals dispersing from their parents and larger-scale population movements. It has long been appreciated that these patterns of movement shape the underlying genealogies along the genome leading to geographic patterns of isolation-by-distance in contemporary population genetic data. However, extracting the enormous amount of information contained in genealogies along recombining sequences has, until recently, not been computationally feasible. Here, we capitalize on important recent advances in genome-wide gene-genealogy reconstruction and develop methods to use thousands of trees to estimate per-generation dispersal rates and to locate the genetic ancestors of a sample back through time. We take a likelihood approach in continuous space using a simple approximate model (branching Brownian motion) as our prior distribution of spatial genealogies. After testing our method with simulations we apply it to Arabidopsis thaliana. We estimate a dispersal rate of roughly 60 km2/generation, slightly higher across latitude than across longitude, potentially reflecting a northward post-glacial expansion. Locating ancestors allows us to visualize major geographic movements, alternative geographic histories, and admixture. Our method highlights the huge amount of information about past dispersal events and population movements contained in genome-wide genealogies.
Collapse
Affiliation(s)
- Matthew Osmond
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| | - Graham Coop
- Department of Evolution & Ecology and Center for Population Biology, University of California, Davis, Davis, United States
| |
Collapse
|
20
|
Temple SD, Waples RK, Browning SR. Modeling recent positive selection using identity-by-descent segments. Am J Hum Genet 2024; 111:2510-2529. [PMID: 39362217 PMCID: PMC11568764 DOI: 10.1016/j.ajhg.2024.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 10/05/2024] Open
Abstract
Recent positive selection can result in an excess of long identity-by-descent (IBD) haplotype segments overlapping a locus. The statistical methods that we propose here address three major objectives in studying selective sweeps: scanning for regions of interest, identifying possible sweeping alleles, and estimating a selection coefficient s. First, we implement a selection scan to locate regions with excess IBD rates. Second, we estimate the allele frequency and location of an unknown sweeping allele by aggregating over variants that are more abundant in an inferred outgroup with excess IBD rate versus the rest of the sample. Third, we propose an estimator for the selection coefficient and quantify uncertainty using the parametric bootstrap. Comparing against state-of-the-art methods in extensive simulations, we show that our methods are more precise at estimating s when s≥0.015. We also show that our 95% confidence intervals contain s in nearly 95% of our simulations. We apply these methods to study positive selection in European ancestry samples from the Trans-Omics for Precision Medicine project. We analyze eight loci where IBD rates are more than four standard deviations above the genome-wide median, including LCT where the maximum IBD rate is 35 standard deviations above the genome-wide median. Overall, we present robust and accurate approaches to study recent adaptive evolution without knowing the identity of the causal allele or using time series data.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
| | - Ryan K Waples
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
21
|
Nieuwoudt C, Farooq FB, Brooks-Wilson A, Bureau A, Graham J. Statistics to prioritize rare variants in family-based sequencing studies with disease subtypes. Genet Epidemiol 2024; 48:324-343. [PMID: 38940260 DOI: 10.1002/gepi.22579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 03/26/2024] [Accepted: 06/13/2024] [Indexed: 06/29/2024]
Abstract
Family-based sequencing studies are increasingly used to find rare genetic variants of high risk for disease traits with familial clustering. In some studies, families with multiple disease subtypes are collected and the exomes of affected relatives are sequenced for shared rare variants (RVs). Since different families can harbor different causal variants and each family harbors many RVs, tests to detect causal variants can have low power in this study design. Our goal is rather to prioritize shared variants for further investigation by, for example, pathway analyses or functional studies. The transmission-disequilibrium test prioritizes variants based on departures from Mendelian transmission in parent-child trios. Extending this idea to families, we propose methods to prioritize RVs shared in affected relatives with two disease subtypes, with one subtype more heritable than the other. Global approaches condition on a variant being observed in the study and assume a known probability of carrying a causal variant. In contrast, local approaches condition on a variant being observed in specific families to eliminate the carrier probability. Our simulation results indicate that global approaches are robust to misspecification of the carrier probability and prioritize more effectively than local approaches even when the carrier probability is misspecified.
Collapse
Affiliation(s)
- Christina Nieuwoudt
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Fabiha Binte Farooq
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Angela Brooks-Wilson
- Department of Biomedical Physiology and Kinesiology, Simon Fraser University, Burnaby, British Columbia, Canada
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Alexandre Bureau
- Département de Médecine Sociale et Préventive, Université Laval, Québec City, Québec, Canada
- Centre de recherche CERVO, Québec City, Québec, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
22
|
Wei Y, Zhi D, Zhang S. Fast and accurate local ancestry inference with Recomb-Mix. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.17.567650. [PMID: 38014185 PMCID: PMC10680832 DOI: 10.1101/2023.11.17.567650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
The availability of large genotyped cohorts brings new opportunities for revealing the high-resolution genetic structure of admixed populations via local ancestry inference (LAI), the process of identifying the ancestry of each segment of an individual haplotype. Though current methods achieve high accuracy in standard cases, LAI is still challenging when reference populations are more similar (e.g., intra-continental), when the number of reference populations is too numerous, or when the admixture events are deep in time, all of which are increasingly unavoidable in large biobanks. Here, we present a new LAI method, Recomb-Mix. Recomb-Mix integrates the elements of existing methods of the site-based Li and Stephens model and introduces a new graph collapsing trick to simplify counting paths with the same ancestry label readout. Through comprehensive benchmarking on various simulated datasets, we show that Recomb-Mix is more accurate than existing methods in diverse sets of scenarios while being competitive in terms of resource efficiency. We expect that Recomb-Mix will be a useful method for advancing genetics studies of admixed populations.
Collapse
Affiliation(s)
- Yuan Wei
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| | - Degui Zhi
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| |
Collapse
|
23
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics 2024; 228:iyae100. [PMID: 39013109 PMCID: PMC11373519 DOI: 10.1093/genetics/iyae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 06/05/2024] [Indexed: 07/18/2024] Open
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8TA, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, Newcastle NE1 7RU, UK
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Anthony W Wohns
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
24
|
Williams MP, Flegontov P, Maier R, Huber CD. Testing times: disentangling admixture histories in recent and complex demographies using ancient DNA. Genetics 2024; 228:iyae110. [PMID: 39013011 PMCID: PMC11373510 DOI: 10.1093/genetics/iyae110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 04/08/2024] [Accepted: 06/11/2024] [Indexed: 07/18/2024] Open
Abstract
Our knowledge of human evolutionary history has been greatly advanced by paleogenomics. Since the 2020s, the study of ancient DNA has increasingly focused on reconstructing the recent past. However, the accuracy of paleogenomic methods in resolving questions of historical and archaeological importance amidst the increased demographic complexity and decreased genetic differentiation remains an open question. We evaluated the performance and behavior of two commonly used methods, qpAdm and the f3-statistic, on admixture inference under a diversity of demographic models and data conditions. We performed two complementary simulation approaches-firstly exploring a wide demographic parameter space under four simple demographic models of varying complexities and configurations using branch-length data from two chromosomes-and secondly, we analyzed a model of Eurasian history composed of 59 populations using whole-genome data modified with ancient DNA conditions such as SNP ascertainment, data missingness, and pseudohaploidization. We observe that population differentiation is the primary factor driving qpAdm performance. Notably, while complex gene flow histories influence which models are classified as plausible, they do not reduce overall performance. Under conditions reflective of the historical period, qpAdm most frequently identifies the true model as plausible among a small candidate set of closely related populations. To increase the utility for resolving fine-scaled hypotheses, we provide a heuristic for further distinguishing between candidate models that incorporates qpAdm model P-values and f3-statistics. Finally, we demonstrate a significant performance increase for qpAdm using whole-genome branch-length f2-statistics, highlighting the potential for improved demographic inference that could be achieved with future advancements in f-statistic estimations.
Collapse
Affiliation(s)
- Matthew P Williams
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Pavel Flegontov
- Department of Biology and Ecology, University of Ostrava, Ostrava 701 03, Czechia
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Robert Maier
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Christian D Huber
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
25
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588318. [PMID: 38645049 PMCID: PMC11030438 DOI: 10.1101/2024.04.07.588318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q , and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q . Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q . In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q .
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
26
|
Xu P, Liang S, Hahn A, Zhao V, Lo WT‘J, Haller BC, Sobkowiak B, Chitwood MH, Colijn C, Cohen T, Rhee KY, Messer PW, Wells MT, Clark AG, Kim J. e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.29.601123. [PMID: 39005464 PMCID: PMC11244936 DOI: 10.1101/2024.06.29.601123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Infectious disease dynamics are driven by the complex interplay of epidemiological, ecological, and evolutionary processes. Accurately modeling these interactions is crucial for understanding pathogen spread and informing public health strategies. However, existing simulators often fail to capture the dynamic interplay between these processes, resulting in oversimplified models that do not fully reflect real-world complexities in which the pathogen's genetic evolution dynamically influences disease transmission. We introduce the epidemiological-ecological-evolutionary simulator (e3SIM), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors. Using an agent-based, discrete-generation, forward-in-time approach, e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. This integration allows for realistic simulations of disease spread and pathogen evolution. Key features include a modular and scalable design, flexibility in modeling various epidemiological and population-genetic complexities, incorporation of time-varying environmental factors, and a user-friendly graphical interface. We demonstrate e3SIM's capabilities through simulations of realistic outbreak scenarios with SARS-CoV-2 and Mycobacterium tuberculosis, illustrating its flexibility for studying the genomic epidemiology of diverse pathogen types.
Collapse
Affiliation(s)
- Peiyu Xu
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
| | - Shenni Liang
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Andrew Hahn
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Vivian Zhao
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Wai Tung ‘Jack’ Lo
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin C. Haller
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin Sobkowiak
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Melanie H. Chitwood
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| | - Ted Cohen
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Kyu Y. Rhee
- Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Philipp W. Messer
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Martin T. Wells
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Andrew G. Clark
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| |
Collapse
|
27
|
Aktürk Ş, Mapelli I, Güler MN, Gürün K, Katırcıoğlu B, Vural KB, Sağlıcan E, Çetin M, Yaka R, Sürer E, Atağ G, Çokoğlu SS, Sevkar A, Altınışık NE, Koptekin D, Somel M. Benchmarking kinship estimation tools for ancient genomes using pedigree simulations. Mol Ecol Resour 2024; 24:e13960. [PMID: 38676702 DOI: 10.1111/1755-0998.13960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 03/19/2024] [Accepted: 03/28/2024] [Indexed: 04/29/2024]
Abstract
There is growing interest in uncovering genetic kinship patterns in past societies using low-coverage palaeogenomes. Here, we benchmark four tools for kinship estimation with such data: lcMLkin, NgsRelate, KIN, and READ, which differ in their input, IBD estimation methods, and statistical approaches. We used pedigree and ancient genome sequence simulations to evaluate these tools when only a limited number (1 to 50 K, with minor allele frequency ≥0.01) of shared SNPs are available. The performance of all four tools was comparable using ≥20 K SNPs. We found that first-degree related pairs can be accurately classified even with 1 K SNPs, with 85% F1 scores using READ and 96% using NgsRelate or lcMLkin. Distinguishing third-degree relatives from unrelated pairs or second-degree relatives was also possible with high accuracy (F1 > 90%) with 5 K SNPs using NgsRelate and lcMLkin, while READ and KIN showed lower success (69 and 79% respectively). Meanwhile, noise in population allele frequencies and inbreeding (first-cousin mating) led to deviations in kinship coefficients, with different sensitivities across tools. We conclude that using multiple tools in parallel might be an effective approach to achieve robust estimates on ultra-low-coverage genomes.
Collapse
Affiliation(s)
- Şevval Aktürk
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Igor Mapelli
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Merve N Güler
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kanat Gürün
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Büşra Katırcıoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kıvılcım Başak Vural
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Ekin Sağlıcan
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Çetin
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Reyhan Yaka
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
- Centre for Palaeogenetics, Stockholm, Sweden
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Elif Sürer
- Department of Modeling and Simulation, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Gözde Atağ
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Sevim Seda Çokoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Arda Sevkar
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - N Ezgi Altınışık
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - Dilek Koptekin
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Somel
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
28
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. Bioinformatics 2024; 40:btae334. [PMID: 38796683 PMCID: PMC11784591 DOI: 10.1093/bioinformatics/btae334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024] Open
Abstract
SUMMARY Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. AVAILABILITY AND IMPLEMENTATION tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait).
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
29
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.03.565466. [PMID: 37961279 PMCID: PMC10635123 DOI: 10.1101/2023.11.03.565466] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalises these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, UK
- Department of Statistics, University of Oxford, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, UK
- Department of Statistics, University of Warwick, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Anthony W. Wohns
- Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
30
|
Murga-Moreno J, Casillas S, Barbadilla A, Uricchio L, Enard D. An efficient and robust ABC approach to infer the rate and strength of adaptation. G3 (BETHESDA, MD.) 2024; 14:jkae031. [PMID: 38365205 PMCID: PMC11090462 DOI: 10.1093/g3journal/jkae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 10/10/2023] [Accepted: 01/29/2024] [Indexed: 02/18/2024]
Abstract
Inferring the effects of positive selection on genomes remains a critical step in characterizing the ultimate and proximate causes of adaptation across species, and quantifying positive selection remains a challenge due to the confounding effects of many other evolutionary processes. Robust and efficient approaches for adaptation inference could help characterize the rate and strength of adaptation in nonmodel species for which demographic history, mutational processes, and recombination patterns are not currently well-described. Here, we introduce an efficient and user-friendly extension of the McDonald-Kreitman test (ABC-MK) for quantifying long-term protein adaptation in specific lineages of interest. We characterize the performance of our approach with forward simulations and find that it is robust to many demographic perturbations and positive selection configurations, demonstrating its suitability for applications to nonmodel genomes. We apply ABC-MK to the human proteome and a set of known virus interacting proteins (VIPs) to test the long-term adaptation in genes interacting with viruses. We find substantially stronger signatures of positive selection on RNA-VIPs than DNA-VIPs, suggesting that RNA viruses may be an important driver of human adaptation over deep evolutionary time scales.
Collapse
Affiliation(s)
- Jesús Murga-Moreno
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85719, USA
| | - Sònia Casillas
- Department of Genetics and Microbiology, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
- Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | - Antonio Barbadilla
- Department of Genetics and Microbiology, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
- Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | | | - David Enard
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85719, USA
| |
Collapse
|
31
|
Riley R, Mathieson I, Mathieson S. Interpreting generative adversarial networks to infer natural selection from genetic data. Genetics 2024; 226:iyae024. [PMID: 38386895 PMCID: PMC10990424 DOI: 10.1093/genetics/iyae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 01/15/2024] [Accepted: 01/19/2024] [Indexed: 02/24/2024] Open
Abstract
Understanding natural selection and other forms of non-neutrality is a major focus for the use of machine learning in population genetics. Existing methods rely on computationally intensive simulated training data. Unlike efficient neutral coalescent simulations for demographic inference, realistic simulations of selection typically require slow forward simulations. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Finally, it is difficult to interpret trained neural networks, leading to a lack of understanding about what features contribute to classification. Here we develop a new approach to detect selection and other local evolutionary processes that requires relatively few selection simulations during training. We build upon a generative adversarial network trained to simulate realistic neutral data. This consists of a generator (fitted demographic model), and a discriminator (convolutional neural network) that predicts whether a genomic region is real or fake. As the generator can only generate data under neutral demographic processes, regions of real data that the discriminator recognizes as having a high probability of being "real" do not fit the neutral demographic model and are therefore candidates for targets of selection. To incentivize identification of a specific mode of selection, we fine-tune the discriminator with a small number of custom non-neutral simulations. We show that this approach has high power to detect various forms of selection in simulations, and that it finds regions under positive selection identified by state-of-the-art population genetic methods in three human populations. Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics.
Collapse
Affiliation(s)
- Rebecca Riley
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| | - Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| |
Collapse
|
32
|
Johnson OL, Tobler R, Schmidt JM, Huber CD. Population genetic simulation: Benchmarking frameworks for non-standard models of natural selection. Mol Ecol Resour 2024; 24:e13930. [PMID: 38247258 PMCID: PMC10932895 DOI: 10.1111/1755-0998.13930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/21/2023] [Accepted: 01/09/2024] [Indexed: 01/23/2024]
Abstract
Population genetic simulation has emerged as a common tool for investigating increasingly complex evolutionary and demographic models. Software capable of handling high-level model complexity has recently been developed, and the advancement of tree sequence recording now allows simulations to merge the efficiency and genealogical insight of coalescent simulations with the flexibility of forward simulations. However, frameworks utilizing these features have not yet been compared and benchmarked. Here, we evaluate various simulation workflows using the coalescent simulator msprime and the forward simulator SLiM, to assess resource efficiency and determine an optimal simulation framework. Three aspects were evaluated: (1) the burn-in, to establish an equilibrium level of neutral diversity in the population; (2) the forward simulation, in which temporally fluctuating selection is acting; and (3) the final computation of summary statistics. We provide typical memory and computation time requirements for each step. We find that the fastest framework, a combination of coalescent and forward simulation with tree sequence recording, increases simulation speed by over twenty times compared to classical forward simulations without tree sequence recording, although it does require six times more memory. Overall, using efficient simulation workflows can lead to a substantial improvement when modelling complex evolutionary scenarios-although the optimal framework ultimately depends on the available computational resources.
Collapse
Affiliation(s)
| | - Raymond Tobler
- Evolution of Cultural Diversity Initiative, The Australian National University, Australia
| | - Joshua M. Schmidt
- Department of Ophthalmology, College of Medicine and Public Health, Flinders University, Australia
| | - Christian D. Huber
- School of Biological Sciences, University of Adelaide, Australia
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
33
|
Song H, Chu J, Li W, Li X, Fang L, Han J, Zhao S, Ma Y. A Novel Approach Utilizing Domain Adversarial Neural Networks for the Detection and Classification of Selective Sweeps. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2304842. [PMID: 38308186 PMCID: PMC11005742 DOI: 10.1002/advs.202304842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/10/2024] [Indexed: 02/04/2024]
Abstract
The identification and classification of selective sweeps are of great significance for improving the understanding of biological evolution and exploring opportunities for precision medicine and genetic improvement. Here, a domain adaptation sweep detection and classification (DASDC) method is presented to balance the alignment of two domains and the classification performance through a domain-adversarial neural network and its adversarial learning modules. DASDC effectively addresses the issue of mismatch between training data and real genomic data in deep learning models, leading to a significant improvement in its generalization capability, prediction robustness, and accuracy. The DASDC method demonstrates improved identification performance compared to existing methods and excels in classification performance, particularly in scenarios where there is a mismatch between application data and training data. The successful implementation of DASDC in real data of three distinct species highlights its potential as a useful tool for identifying crucial functional genes and investigating adaptive evolutionary mechanisms, particularly with the increasing availability of genomic data.
Collapse
Affiliation(s)
- Hui Song
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Jinyu Chu
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Wangjiao Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
| | - Lingzhao Fang
- Center for Quantitative Genetics and GenomicsAarhus UniversityAarhus8000Denmark
| | - Jianlin Han
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- CAAS‐ILRI Joint Laboratory on Livestock and Forage Genetic ResourcesInstitute of Animal ScienceChinese Academy of Agricultural Sciences (CAAS)Beijing100193China
- Livestock Genetics ProgramInternational Livestock Research Institute (ILRI)Nairobi00100Kenya
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| | - Yunlong Ma
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| |
Collapse
|
34
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.13.584790. [PMID: 38559118 PMCID: PMC10980058 DOI: 10.1101/2024.03.13.584790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Summary Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact daiki.tagami@hertford.ox.ac.uk.
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
35
|
Ray DD, Flagel L, Schrider DR. IntroUNET: Identifying introgressed alleles via semantic segmentation. PLoS Genet 2024; 20:e1010657. [PMID: 38377104 PMCID: PMC10906877 DOI: 10.1371/journal.pgen.1010657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/01/2024] [Accepted: 01/29/2024] [Indexed: 02/22/2024] Open
Abstract
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
Collapse
Affiliation(s)
- Dylan D. Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Lex Flagel
- Division of Data Science, Gencove Inc., New York, New York, United States of America
- Department of Plant and Microbial Biology, University of Minnesota, Saint Paul, Minnesota, United States of America
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
36
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CWK, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023; 110:2077-2091. [PMID: 38065072 PMCID: PMC10716520 DOI: 10.1016/j.ajhg.2023.10.017] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 12/18/2023] Open
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Joshua G Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
37
|
Schrider DR. Allelic gene conversion softens selective sweeps. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.05.570141. [PMID: 38106127 PMCID: PMC10723294 DOI: 10.1101/2023.12.05.570141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The prominence of positive selection, in which beneficial mutations are favored by natural selection and rapidly increase in frequency, is a subject of intense debate. Positive selection can result in selective sweeps, in which the haplotype(s) bearing the adaptive allele "sweep" through the population, thereby removing much of the genetic diversity from the region surrounding the target of selection. Two models of selective sweeps have been proposed: classical sweeps, or "hard sweeps", in which a single copy of the adaptive allele sweeps to fixation, and "soft sweeps", in which multiple distinct copies of the adaptive allele leave descendants after the sweep. Soft sweeps can be the outcome of recurrent mutation to the adaptive allele, or the presence of standing genetic variation consisting of multiple copies of the adaptive allele prior to the onset of selection. Importantly, soft sweeps will be common when populations can rapidly adapt to novel selective pressures, either because of a high mutation rate or because adaptive alleles are already present. The prevalence of soft sweeps is especially controversial, and it has been noted that selection on standing variation or recurrent mutations may not always produce soft sweeps. Here, we show that the inverse is true: selection on single-origin de novo mutations may often result in an outcome that is indistinguishable from a soft sweep. This is made possible by allelic gene conversion, which "softens" hard sweeps by copying the adaptive allele onto multiple genetic backgrounds, a process we refer to as a "pseudo-soft" sweep. We carried out a simulation study examining the impact of gene conversion on sweeps from a single de novo variant in models of human, Drosophila, and Arabidopsis populations. The fraction of simulations in which gene conversion had produced multiple haplotypes with the adaptive allele upon fixation was appreciable. Indeed, under realistic demographic histories and gene conversion rates, even if selection always acts on a single-origin mutation, sweeps involving multiple haplotypes are more likely than hard sweeps in large populations, especially when selection is not extremely strong. Thus, even when the mutation rate is low or there is no standing variation, hard sweeps are expected to be the exception rather than the rule in large populations. These results also imply that the presence of signatures of soft sweeps does not necessarily mean that adaptation has been especially rapid or is not mutation limited.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
38
|
Mwima R, Hui TYJ, Nanteza A, Burt A, Kayondo JK. Potential persistence mechanisms of the major Anopheles gambiae species complex malaria vectors in sub-Saharan Africa: a narrative review. Malar J 2023; 22:336. [PMID: 37936194 PMCID: PMC10631165 DOI: 10.1186/s12936-023-04775-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 10/30/2023] [Indexed: 11/09/2023] Open
Abstract
The source of malaria vector populations that re-establish at the beginning of the rainy season is still unclear yet knowledge of mosquito behaviour is required to effectively institute control measures. Alternative hypotheses like aestivation, local refugia, migration between neighbouring sites, and long-distance migration (LDM) are stipulated to support mosquito persistence. This work assessed the malaria vector persistence dynamics and examined various studies done on vector survival via these hypotheses; aestivation, local refugia, local or long-distance migration across sub-Saharan Africa, explored a range of methods used, ecological parameters and highlighted the knowledge trends and gaps. The results about a particular persistence mechanism that supports the re-establishment of Anopheles gambiae, Anopheles coluzzii or Anopheles arabiensis in sub-Saharan Africa were not conclusive given that each method used had its limitations. For example, the Mark-Release-Recapture (MRR) method whose challenge is a low recapture rate that affects its accuracy, and the use of time series analysis through field collections whose challenge is the uncertainty about whether not finding mosquitoes during the dry season is a weakness of the conventional sampling methods used or because of hidden shelters. This, therefore, calls for further investigations emphasizing the use of ecological experiments under controlled conditions in the laboratory or semi-field, and genetic approaches, as they are known to complement each other. This review, therefore, unveils and assesses the uncertainties that influence the different malaria vector persistence mechanisms and provides recommendations for future studies.
Collapse
Affiliation(s)
- Rita Mwima
- Department of Entomology, Uganda Virus Research Institute (UVRI), Entebbe, Uganda
- Department of Biotechnical and Diagnostic Sciences, College of Veterinary Medicine, Animal Resources and Biosecurity (COVAB), Makerere University, Kampala, Uganda
| | - Tin-Yu J Hui
- Silwood Park Campus, Department of Life Sciences, Imperial College London, Ascot, UK
| | - Ann Nanteza
- Department of Biotechnical and Diagnostic Sciences, College of Veterinary Medicine, Animal Resources and Biosecurity (COVAB), Makerere University, Kampala, Uganda
| | - Austin Burt
- Silwood Park Campus, Department of Life Sciences, Imperial College London, Ascot, UK
| | - Jonathan K Kayondo
- Department of Entomology, Uganda Virus Research Institute (UVRI), Entebbe, Uganda.
| |
Collapse
|
39
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. Genetics 2023; 225:iyad168. [PMID: 37724741 PMCID: PMC10627256 DOI: 10.1093/genetics/iyad168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/01/2023] [Accepted: 09/08/2023] [Indexed: 09/21/2023] Open
Abstract
The discrete-time Wright-Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
40
|
Zhang Y, Zhu Q, Shao Y, Jiang Y, Ouyang Y, Zhang L, Zhang W. Inferring Historical Introgression with Deep Learning. Syst Biol 2023; 72:1013-1038. [PMID: 37257491 DOI: 10.1093/sysbio/syad033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 05/28/2023] [Accepted: 05/30/2023] [Indexed: 06/02/2023] Open
Abstract
Resolving phylogenetic relationships among taxa remains a challenge in the era of big data due to the presence of genetic admixture in a wide range of organisms. Rapidly developing sequencing technologies and statistical tests enable evolutionary relationships to be disentangled at a genome-wide level, yet many of these tests are computationally intensive and rely on phased genotypes, large sample sizes, restricted phylogenetic topologies, or hypothesis testing. To overcome these difficulties, we developed a deep learning-based approach, named ERICA, for inferring genome-wide evolutionary relationships and local introgressed regions from sequence data. ERICA accepts sequence alignments of both population genomic data and multiple genome assemblies, and efficiently identifies discordant genealogy patterns and exchanged regions across genomes when compared with other methods. We further tested ERICA using real population genomic data from Heliconius butterflies that have undergone adaptive radiation and frequent hybridization. Finally, we applied ERICA to characterize hybridization and introgression in wild and cultivated rice, revealing the important role of introgression in rice domestication and adaptation. Taken together, our findings demonstrate that ERICA provides an effective method for teasing apart evolutionary relationships using whole genome data, which can ultimately facilitate evolutionary studies on hybridization and introgression.
Collapse
Affiliation(s)
- Yubo Zhang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Qingjie Zhu
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yi Shao
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Yanchen Jiang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| | - Yidan Ouyang
- National Key Laboratory of Crop Genetic Improvement and National Centre of Plant Gene Research (Wuhan), Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
| | - Li Zhang
- Chinese Institute for Brain Research, Beijing 102206, China
| | - Wei Zhang
- State Key Laboratory of Protein and Plant Gene Research, Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
41
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. PLoS Genet 2023; 19:e1011032. [PMID: 37934781 PMCID: PMC10655966 DOI: 10.1371/journal.pgen.1011032] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 11/17/2023] [Accepted: 10/23/2023] [Indexed: 11/09/2023] Open
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
42
|
Fan C, Cahoon JL, Dinh BL, Ortega-Del Vecchyo D, Huber C, Edge MD, Mancuso N, Chiang CWK. A likelihood-based framework for demographic inference from genealogical trees. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561787. [PMID: 37873208 PMCID: PMC10592779 DOI: 10.1101/2023.10.10.561787] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
The demographic history of a population drives the pattern of genetic variation and is encoded in the gene-genealogical trees of the sampled alleles. However, existing methods to infer demographic history from genetic data tend to use relatively low-dimensional summaries of the genealogy, such as allele frequency spectra. As a step toward capturing more of the information encoded in the genome-wide sequence of genealogical trees, here we propose a novel framework called the genealogical likelihood (gLike), which derives the full likelihood of a genealogical tree under any hypothesized demographic history. Employing a graph-based structure, gLike summarizes across independent trees the relationships among all lineages in a tree with all possible trajectories of population memberships through time and efficiently computes the exact marginal probability under a parameterized demographic model. Through extensive simulations and empirical applications on populations that have experienced multiple admixtures, we showed that gLike can accurately estimate dozens of demographic parameters when the true genealogy is known, including ancestral population sizes, admixture timing, and admixture proportions. Moreover, when using genealogical trees inferred from genetic data, we showed that gLike outperformed conventional demographic inference methods that leverage only the allele-frequency spectrum and yielded parameter estimates that align with established historical knowledge of the past demographic histories for populations like Latino Americans and Native Hawaiians. Furthermore, our framework can trace ancestral histories by analyzing a sample from the admixed population without proxies for its source populations, removing the need to sample ancestral populations that may no longer exist. Taken together, our proposed gLike framework harnesses underutilized genealogical information to offer exceptional sensitivity and accuracy in inferring complex demographies for humans and other species, particularly as estimation of genome-wide genealogies improves.
Collapse
Affiliation(s)
- Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Jordan L Cahoon
- Department of Quantitative and Computational Biology, University of Southern California
- Department of Computer Science, University of Southern California
| | - Bryan L Dinh
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Diego Ortega-Del Vecchyo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Juriquilla, Querétaro, México
| | - Christian Huber
- Department of Biology, Penn State University, University Park, PA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| |
Collapse
|
43
|
Pivirotto AM, Platt A, Patel R, Kumar S, Hey J. Analyses of allele age and fitness impact reveal human beneficial alleles to be older than neutral controls. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.09.561569. [PMID: 37873438 PMCID: PMC10592680 DOI: 10.1101/2023.10.09.561569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
A classic population genetic prediction is that alleles experiencing directional selection should swiftly traverse allele frequency space, leaving detectable reductions in genetic variation in linked regions. However, despite this expectation, identifying clear footprints of beneficial allele passage has proven to be surprisingly challenging. We addressed the basic premise underlying this expectation by estimating the ages of large numbers of beneficial and deleterious alleles in a human population genomic data set. Deleterious alleles were found to be young, on average, given their allele frequency. However, beneficial alleles were older on average than non-coding, non-regulatory alleles of the same frequency. This finding is not consistent with directional selection and instead indicates some type of balancing selection. Among derived beneficial alleles, those fixed in the population show higher local recombination rates than those still segregating, consistent with a model in which new beneficial alleles experience an initial period of balancing selection due to linkage disequilibrium with deleterious recessive alleles. Alleles that ultimately fix following a period of balancing selection will leave a modest 'soft' sweep impact on the local variation, consistent with the overall paucity of species-wide 'hard' sweeps in human genomes.
Collapse
Affiliation(s)
| | - Alexander Platt
- Temple University, Department of Biology, Philadelphia PA 19122, USA
- University of Pennsylvania, Department of Genetics, Philadelphia PA 19104, USA
| | - Ravi Patel
- Temple University, Department of Biology, Philadelphia PA 19122, USA
- Institute for Genomics and Evolutionary Medicine, Temple University, PA 19122, USA
| | - Sudhir Kumar
- Temple University, Department of Biology, Philadelphia PA 19122, USA
- Institute for Genomics and Evolutionary Medicine, Temple University, PA 19122, USA
| | - Jody Hey
- Temple University, Department of Biology, Philadelphia PA 19122, USA
| |
Collapse
|
44
|
Medina-Muñoz SG, Ortega-Del Vecchyo D, Cruz-Hervert LP, Ferreyra-Reyes L, García-García L, Moreno-Estrada A, Ragsdale AP. Demographic modeling of admixed Latin American populations from whole genomes. Am J Hum Genet 2023; 110:1804-1816. [PMID: 37725976 PMCID: PMC10577084 DOI: 10.1016/j.ajhg.2023.08.015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/17/2023] [Accepted: 08/23/2023] [Indexed: 09/21/2023] Open
Abstract
Demographic models of Latin American populations often fail to fully capture their complex evolutionary history, which has been shaped by both recent admixture and deeper-in-time demographic events. To address this gap, we used high-coverage whole-genome data from Indigenous American ancestries in present-day Mexico and existing genomes from across Latin America to infer multiple demographic models that capture the impact of different timescales on genetic diversity. Our approach, which combines analyses of allele frequencies and ancestry tract length distributions, represents a significant improvement over current models in predicting patterns of genetic variation in admixed Latin American populations. We jointly modeled the contribution of European, African, East Asian, and Indigenous American ancestries into present-day Latin American populations. We infer that the ancestors of Indigenous Americans and East Asians diverged ∼30 thousand years ago, and we characterize genetic contributions of recent migrations from East and Southeast Asia to Peru and Mexico. Our inferred demographic histories are consistent across different genomic regions and annotations, suggesting that our inferences are robust to the potential effects of linked selection. In conjunction with published distributions of fitness effects for new nonsynonymous mutations in humans, we show in large-scale simulations that our models recover important features of both neutral and deleterious variation. By providing a more realistic framework for understanding the evolutionary history of Latin American populations, our models can help address the historical under-representation of admixed groups in genomics research and can be a valuable resource for future studies of populations with complex admixture and demographic histories.
Collapse
Affiliation(s)
- Santiago G Medina-Muñoz
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico
| | - Diego Ortega-Del Vecchyo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de Mexico, Juriquilla, Querétaro 76230, Mexico
| | | | | | | | - Andrés Moreno-Estrada
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico.
| | - Aaron P Ragsdale
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico; Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
45
|
Murga-Moreno J, Casillas S, Barbadilla A, Uricchio L, Enard D. An efficient and robust ABC approach to infer the rate and strength of adaptation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.29.555322. [PMID: 37693550 PMCID: PMC10491248 DOI: 10.1101/2023.08.29.555322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2023]
Abstract
Inferring the effects of positive selection on genomes remains a critical step in characterizing the ultimate and proximate causes of adaptation across species, and quantifying positive selection remains a challenge due to the confounding effects of many other evolutionary processes. Robust and efficient approaches for adaptation inference could help characterize the rate and strength of adaptation in non-model species for which demographic history, mutational processes, and recombination patterns are not currently well-described. Here, we introduce an efficient and user-friendly extension of the McDonald-Kreitman test (ABC-MK) for quantifying long-term protein adaptation in specific lineages of interest. We characterize the performance of our approach with forward simulations and find that it is robust to many demographic perturbations and positive selection configurations, demonstrating its suitability for applications to non-model genomes. We apply ABC-MK to the human proteome and a set of known Virus Interacting Proteins (VIPs) to test the long-term adaptation in genes interacting with viruses. We find substantially stronger signatures of positive selection on RNA-VIPs than DNA-VIPs, suggesting that RNA viruses may be an important driver of human adaptation over deep evolutionary time scales.
Collapse
Affiliation(s)
- Jesús Murga-Moreno
- University of Arizona Department of Ecology and Evolutionary Biology, Tucson, USA
| | - Sònia Casillas
- Department of Genetics and Microbiology, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
- Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | - Antonio Barbadilla
- Department of Genetics and Microbiology, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
- Institute of Biotechnology and Biomedicine, Universitat Autònoma de Barcelona, Bellaterra, Barcelona 08193, Spain
| | | | - David Enard
- University of Arizona Department of Ecology and Evolutionary Biology, Tucson, USA
| |
Collapse
|
46
|
Riley R, Mathieson I, Mathieson S. INTERPRETING GENERATIVE ADVERSARIAL NETWORKS TO INFER NATURAL SELECTION FROM GENETIC DATA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.07.531546. [PMID: 36945387 PMCID: PMC10028936 DOI: 10.1101/2023.03.07.531546] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/10/2023]
Abstract
Understanding natural selection in humans and other species is a major focus for the use of machine learning in population genetics. Existing methods rely on computationally intensive simulated training data. Unlike efficient neutral coalescent simulations for demographic inference, realistic simulations of selection typically requires slow forward simulations. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Mismatches between simulated training data and real test data can lead to incorrect inference. Finally, it is difficult to interpret trained neural networks, leading to a lack of understanding about what features contribute to classification. Here we develop a new approach to detect selection that requires relatively few selection simulations during training. We use a Generative Adversarial Network (GAN) trained to simulate realistic neutral data. The resulting GAN consists of a generator (fitted demographic model) and a discriminator (convolutional neural network). For a genomic region, the discriminator predicts whether it is "real" or "fake" in the sense that it could have been simulated by the generator. As the "real" training data includes regions that experienced selection and the generator cannot produce such regions, regions with a high probability of being real are likely to have experienced selection. To further incentivize this behavior, we "fine-tune" the discriminator with a small number of selection simulations. We show that this approach has high power to detect selection in simulations, and that it finds regions under selection identified by state-of-the art population genetic methods in three human populations. Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics. In summary, our approach is a novel, efficient, and powerful way to use machine learning to detect natural selection.
Collapse
Affiliation(s)
- Rebecca Riley
- Department of Computer Science, Haverford College, Haverford PA, 19041 USA
| | - Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia PA, 19104 USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford PA, 19041 USA
| |
Collapse
|
47
|
Lehmann B, Mackintosh M, McVean G, Holmes C. Optimal strategies for learning multi-ancestry polygenic scores vary across traits. Nat Commun 2023; 14:4023. [PMID: 37419925 PMCID: PMC10328935 DOI: 10.1038/s41467-023-38930-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 05/22/2023] [Indexed: 07/09/2023] Open
Abstract
Polygenic scores (PGSs) are individual-level measures that aggregate the genome-wide genetic predisposition to a given trait. As PGS have predominantly been developed using European-ancestry samples, trait prediction using such European ancestry-derived PGS is less accurate in non-European ancestry individuals. Although there has been recent progress in combining multiple PGS trained on distinct populations, the problem of how to maximize performance given a multiple-ancestry cohort is largely unexplored. Here, we investigate the effect of sample size and ancestry composition on PGS performance for fifteen traits in UK Biobank. For some traits, PGS estimated using a relatively small African-ancestry training set outperformed, on an African-ancestry test set, PGS estimated using a much larger European-ancestry only training set. We observe similar, but not identical, results when considering other minority-ancestry groups within UK Biobank. Our results emphasise the importance of targeted data collection from underrepresented groups in order to address existing disparities in PGS performance.
Collapse
Affiliation(s)
- Brieuc Lehmann
- Department of Statistical Science, University College London, London, UK.
| | | | - Gil McVean
- Big Data Institute, University of Oxford, Oxford, UK
| | - Chris Holmes
- The Alan Turing Institute, London, UK
- Big Data Institute, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| |
Collapse
|
48
|
Whitehouse LS, Schrider DR. Timesweeper: accurately identifying selective sweeps using population genomic time series. Genetics 2023; 224:iyad084. [PMID: 37157914 PMCID: PMC10324941 DOI: 10.1093/genetics/iyad084] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 07/25/2022] [Accepted: 04/25/2023] [Indexed: 05/10/2023] Open
Abstract
Despite decades of research, identifying selective sweeps, the genomic footprints of positive selection, remains a core problem in population genetics. Of the myriad methods that have been developed to tackle this task, few are designed to leverage the potential of genomic time-series data. This is because in most population genetic studies of natural populations, only a single period of time can be sampled. Recent advancements in sequencing technology, including improvements in extracting and sequencing ancient DNA, have made repeated samplings of a population possible, allowing for more direct analysis of recent evolutionary dynamics. Serial sampling of organisms with shorter generation times has also become more feasible due to improvements in the cost and throughput of sequencing. With these advances in mind, here we present Timesweeper, a fast and accurate convolutional neural network-based tool for identifying selective sweeps in data consisting of multiple genomic samplings of a population over time. Timesweeper analyzes population genomic time-series data by first simulating training data under a demographic model appropriate for the data of interest, training a one-dimensional convolutional neural network on said simulations, and inferring which polymorphisms in this serialized data set were the direct target of a completed or ongoing selective sweep. We show that Timesweeper is accurate under multiple simulated demographic and sampling scenarios, identifies selected variants with high resolution, and estimates selection coefficients more accurately than existing methods. In sum, we show that more accurate inferences about natural selection are possible when genomic time-series data are available; such data will continue to proliferate in coming years due to both the sequencing of ancient samples and repeated samplings of extant populations with faster generation times, as well as experimentally evolved populations where time-series data are often generated. Methodological advances such as Timesweeper thus have the potential to help resolve the controversy over the role of positive selection in the genome. We provide Timesweeper as a Python package for use by the community.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27514, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27514, USA
| |
Collapse
|
49
|
Naseri A, Yue W, Zhang S, Zhi D. Fast inference of genetic recombination rates in biobank scale data. Genome Res 2023; 33:1015-1022. [PMID: 37349109 PMCID: PMC10538484 DOI: 10.1101/gr.277676.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/09/2023] [Indexed: 06/24/2023]
Abstract
Although rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps, are of interest. Although the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel-smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10-kb resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly because FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
Collapse
Affiliation(s)
- Ardalan Naseri
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - William Yue
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, Florida 32816, USA
| | - Degui Zhi
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA;
| |
Collapse
|
50
|
Schweiger R, Durbin R. Ultrafast genome-wide inference of pairwise coalescence times. Genome Res 2023; 33:1023-1031. [PMID: 37562965 PMCID: PMC10538485 DOI: 10.1101/gr.277665.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/21/2023] [Indexed: 08/12/2023]
Abstract
The pairwise sequentially Markovian coalescent (PSMC) algorithm and its extensions infer the coalescence time of two homologous chromosomes at each genomic position. This inference is used in reconstructing demographic histories, detecting selection signatures, studying genome-wide associations, constructing ancestral recombination graphs, and more. Inference of coalescence times between each pair of haplotypes in a large data set is of great interest, as they may provide rich information about the population structure and history of the sample. Here, we introduce a new method, Gamma-SMC, which is more than 10 times faster than current methods. To obtain this speed-up, we represent the posterior coalescence time distributions succinctly as a gamma distribution with just two parameters; in contrast, PSMC and its extensions hold these in a vector over discrete intervals of time. Thus, Gamma-SMC has constant time-complexity per site, without dependence on the number of discrete time states. Additionally, because of this continuous representation, our method is able to infer times spanning many orders of magnitude and, as such, is robust to parameter misspecification. We describe how this approach works, show its performance on simulated and real data, and illustrate its use in studying recent positive selection in the 1000 Genomes Project data set.
Collapse
Affiliation(s)
- Regev Schweiger
- Department of Genetics, University of Cambridge, Cambridge CB2 1TN, United Kingdom
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Cambridge CB2 1TN, United Kingdom
| |
Collapse
|