1
|
Temple SD, Browning SR, Thompson EA. Fast simulation of identity-by-descent segments. Bull Math Biol 2025; 87:84. [PMID: 40410602 PMCID: PMC12102126 DOI: 10.1007/s11538-025-01464-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 05/08/2025] [Indexed: 05/25/2025]
Abstract
The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than 10,000 diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.
Collapse
Affiliation(s)
- Seth D Temple
- Department of Statistics, University of Washington, Seattle, WA, USA.
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
- Michigan Institute of Data Science, University of Michigan, Ann Arbor, MI, USA.
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
2
|
Salles MMA, Domingos FMCB. Towards the next generation of species delimitation methods: an overview of machine learning applications. Mol Phylogenet Evol 2025; 210:108368. [PMID: 40348350 DOI: 10.1016/j.ympev.2025.108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/25/2025] [Accepted: 05/04/2025] [Indexed: 05/14/2025]
Abstract
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, whether based on morphological, molecular, or other types of data. In the case of methods based on DNA sequences, most of them are rooted in the coalescent theory. However, coalescence-based models have limitations, for instance regarding complex evolutionary scenarios and large datasets. In this context, machine learning (ML) can be considered as a promising analytical tool, and provides an effective way to explore dataset structures when species-level divergences are hypothesized. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help uninitiated researchers and students interested in the field. Our review suggests that while current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to coalescent methods for species delimitation. Future ML enterprises to delimit species should consider the constraints related to the use of simulated data, as in other model-based methods relying on simulations. Conversely, the flexibility of ML algorithms offers a significant advantage by enabling the analysis of diverse data types (e.g., genetic and phenotypic) and handling large datasets effectively. We also propose best practices for the use of ML methods in species delimitation, offering insights into potential future applications. We expect that the proposed guidelines will be useful for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation.
Collapse
Affiliation(s)
- Matheus M A Salles
- Departamento de Zoologia, Universidade Federal do Paraná, Curitiba 81531-980, Brazil.
| | | |
Collapse
|
3
|
Arnab SP, Campelo dos Santos AL, Fumagalli M, DeGiorgio M. Efficient Detection and Characterization of Targets of Natural Selection Using Transfer Learning. Mol Biol Evol 2025; 42:msaf094. [PMID: 40341942 PMCID: PMC12062966 DOI: 10.1093/molbev/msaf094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 04/16/2025] [Accepted: 04/17/2025] [Indexed: 05/11/2025] Open
Abstract
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pretrained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | | | - Matteo Fumagalli
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
- The Alan Turing Institute, London, UK
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
4
|
Tittes S, Lorant A, McGinty SP, Holland JB, de Jesus Sánchez-González J, Seetharam A, Tenaillon M, Ross-Ibarra J. The population genetics of convergent adaptation in maize and teosinte is not locally restricted. eLife 2025; 12:RP92405. [PMID: 39945053 PMCID: PMC11825130 DOI: 10.7554/elife.92405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2025] Open
Abstract
What is the genetic architecture of local adaptation and what is the geographic scale over which it operates? We investigated patterns of local and convergent adaptation in five sympatric population pairs of traditionally cultivated maize and its wild relative teosinte (Zea mays subsp. parviglumis). We found that signatures of local adaptation based on the inference of adaptive fixations and selective sweeps are frequently exclusive to individual populations, more so in teosinte compared to maize. However, for both maize and teosinte, selective sweeps are also frequently shared by several populations, and often between subspecies. We were further able to infer that selective sweeps were shared among populations most often via migration, though sharing via standing variation was also common. Our analyses suggest that teosinte has been a continued source of beneficial alleles for maize, even after domestication, and that maize populations have facilitated adaptation in teosinte by moving beneficial alleles across the landscape. Taken together, our results suggest local adaptation in maize and teosinte has an intermediate geographic scale, one that is larger than individual populations but smaller than the species range.
Collapse
Affiliation(s)
- Silas Tittes
- Department of Evolution and Ecology, University of California, DavisDavisUnited States
- Institute of Ecology and Evolution, University of OregonEugeneUnited States
- Center for Population Biology, University of California, DavisDavisUnited States
| | - Anne Lorant
- Department of Plant Sciences, University of California, DavisDavisUnited States
| | - Sean P McGinty
- Department of Integrative Genetics and Genomics, University of California, DavisDavisUnited States
| | - James B Holland
- United States Department of Agriculture– Agriculture Research ServiceRaleighUnited States
- Department of Crop and Soil Sciences, North Carolina State UniversityRaleighUnited States
| | | | - Arun Seetharam
- Department of Ecology, Evolution, and Organismal Biology; Genome Informatics Facility, Iowa State UniversityAmesUnited States
| | - Maud Tenaillon
- Génétique Quantitative et Evolution - Le Moulon, Université Paris-Saclay, INRAE, CNRS, AgroParisTechGif-sur-YvetteFrance
| | - Jeffrey Ross-Ibarra
- Department of Evolution and Ecology, University of California, DavisDavisUnited States
- Center for Population Biology, University of California, DavisDavisUnited States
- Genome Center, University of California, DavisDavisUnited States
| |
Collapse
|
5
|
Temple SD, Browning SR. Multiple-testing corrections in selection scans using identity-by-descent segments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.29.635528. [PMID: 39975073 PMCID: PMC11838353 DOI: 10.1101/2025.01.29.635528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Failing to correct for multiple testing in selection scans can lead to false discoveries of recent genetic adaptations. The scanning statistics in selection studies are often too complicated to theoretically derive a genome-wide significance level or empirically validate control of the family-wise error rate (FWER). By modeling the autocorrelation of identity-by-descent (IBD) rates, we propose a computationally efficient method to determine genome-wide significance levels in an IBD-based scan for recent positive selection. In whole genome simulations, we show that our method has approximate control of the FWER and can adapt to the spacing of tests along the genome. We also show that these scans can have more than fifty percent power to reject the null model in hard sweeps with a selection coefficient s > = 0.01 and a sweeping allele frequency between twenty-five and seventy-five percent. A few human genes and gene complexes have statistically significant excesses of IBD segments in thousands of samples of African, European, and South Asian ancestry groups from the Trans-Omics for Precision Medicine project and the United Kingdom Biobank. Among the significant loci, many signals of recent selection are shared across ancestry groups. One shared selection signal at a skeletal cell development gene is extremely strong in African ancestry samples.
Collapse
Affiliation(s)
- Seth D. Temple
- Department of Statistics, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
- Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Sharon R. Browning
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
6
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. Genetics 2025; 229:1-57. [PMID: 39503241 PMCID: PMC11708920 DOI: 10.1093/genetics/iyae180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 10/18/2024] [Indexed: 11/13/2024] Open
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q, and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q. Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward; thus, it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q. In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q.
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
7
|
Amin MR, Hasan M, DeGiorgio M. Digital Image Processing to Detect Adaptive Evolution. Mol Biol Evol 2024; 41:msae242. [PMID: 39565932 PMCID: PMC11631197 DOI: 10.1093/molbev/msae242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 10/28/2024] [Accepted: 11/13/2024] [Indexed: 11/22/2024] Open
Abstract
In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
8
|
Whitehouse LS, Ray DD, Schrider DR. Tree Sequences as a General-Purpose Tool for Population Genetic Inference. Mol Biol Evol 2024; 41:msae223. [PMID: 39460991 PMCID: PMC11600592 DOI: 10.1093/molbev/msae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 10/05/2024] [Accepted: 10/17/2024] [Indexed: 10/28/2024] Open
Abstract
As population genetic data increase in size, new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks applied to population genetic alignments. To better utilize these new data structures, we propose and implement a graph convolutional network to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard convolutional neural network approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a graph convolutional network approach and can be used to perform well on these common population genetic inference tasks with accuracies roughly matching or even exceeding that of a convolutional neural network-based method. As tree sequences become more widely used in population genetic research, we foresee developments and optimizations of this work to provide a foundation for population genetic inference moving forward.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
9
|
Whitehouse LS, Ray D, Schrider DR. Tree sequences as a general-purpose tool for population genetic inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581288. [PMID: 39185244 PMCID: PMC11343121 DOI: 10.1101/2024.02.20.581288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
As population genetics data increases in size new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient, but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks (CNNs) applied to population genetic alignments. To better utilize these new data structures we propose and implement a graph convolutional network (GCN) to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard CNN approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a GCN approach and can be used to perform well on these common population genetics inference tasks with accuracies roughly matching or even exceeding that of a CNN-based method. As tree sequences become more widely used in population genetics research we foresee developments and optimizations of this work to provide a foundation for population genetics inference moving forward.
Collapse
Affiliation(s)
- Logan S. Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Dylan Ray
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| |
Collapse
|
10
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588318. [PMID: 38645049 PMCID: PMC11030438 DOI: 10.1101/2024.04.07.588318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q , and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q . Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q . In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q .
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
11
|
Wang Y, Allen SL, Reddiex AJ, Chenoweth SF. The impacts of positive selection on genomic variation in Drosophila serrata: Insights from a deep learning approach. Mol Ecol 2024; 33:e17499. [PMID: 39188068 DOI: 10.1111/mec.17499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 07/22/2024] [Accepted: 08/07/2024] [Indexed: 08/28/2024]
Abstract
This study explores the impact of positive selection on the genetic composition of a Drosophila serrata population in eastern Australia through a comprehensive analysis of 110 whole genome sequences. Utilizing an advanced deep learning algorithm (partialS/HIC) and a range of inferred demographic histories, we identified that approximately 14% of the genome is directly affected by sweeps, with soft sweeps being more prevalent (10.6%) than hard sweeps (2.1%), and partial sweeps being uncommon (1.3%). The algorithm demonstrated robustness to demographic assumptions in classifying complete sweeps but faced challenges in distinguishing neutral regions from partial sweeps and linked regions under demographic misspecification. The findings reveal the indirect influence of sweeps on nearly two-thirds of the genome through linkage, with an over-representation of putatively deleterious variants suggesting that positive selection drags deleterious variants to higher frequency due to hitchhiking with beneficial loci. Gene ontology enrichment analysis further supported our confidence in the accuracy of sweep detection as several traits expected to be under positive selection due to evolutionary arms races (e.g. immunity) were detected in hard sweeps. This study provides valuable insights into the direct and indirect contributions of positive selection in shaping genomic variation in natural populations.
Collapse
Affiliation(s)
- Yiguan Wang
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK
| | - Scott L Allen
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
| | - Adam J Reddiex
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
- Biological Data Science Institute, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Stephen F Chenoweth
- School of Biological Sciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
12
|
Vaughn AH, Nielsen R. Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA. Mol Biol Evol 2024; 41:msae156. [PMID: 39078618 PMCID: PMC11321360 DOI: 10.1093/molbev/msae156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/31/2024] Open
Abstract
We here present CLUES2, a full-likelihood method to infer natural selection from sequence data that is an extension of the method CLUES. We make several substantial improvements to the CLUES method that greatly increases both its applicability and its speed. We add the ability to use ancestral recombination graphs on ancient data as emissions to the underlying hidden Markov model, which enables CLUES2 to use both temporal and linkage information to make estimates of selection coefficients. We also fully implement the ability to estimate distinct selection coefficients in different epochs, which allows for the analysis of changes in selective pressures through time, as well as selection with dominance. In addition, we greatly increase the computational efficiency of CLUES2 over CLUES using several approximations to the forward-backward algorithms and develop a new way to reconstruct historic allele frequencies by integrating over the uncertainty in the estimation of the selection coefficients. We illustrate the accuracy of CLUES2 through extensive simulations and validate the importance sampling framework for integrating over the uncertainty in the inference of gene trees. We also show that CLUES2 is well-calibrated by showing that under the null hypothesis, the distribution of log-likelihood ratios follows a χ2 distribution with the appropriate degrees of freedom. We run CLUES2 on a set of recently published ancient human data from Western Eurasia and test for evidence of changing selection coefficients through time. We find significant evidence of changing selective pressures in several genes correlated with the introduction of agriculture to Europe and the ensuing dietary and demographic shifts of that time. In particular, our analysis supports previous hypotheses of strong selection on lactase persistence during periods of ancient famines and attenuated selection in more modern periods.
Collapse
Affiliation(s)
- Andrew H Vaughn
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Rasmus Nielsen
- Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720, USA
- Center for GeoGenetics, University of Copenhagen, Copenhagen DK-1350, Denmark
| |
Collapse
|
13
|
Belman S, Pesonen H, Croucher NJ, Bentley SD, Corander J. Estimating between-country migration in pneumococcal populations. G3 (BETHESDA, MD.) 2024; 14:jkae058. [PMID: 38507601 PMCID: PMC11152062 DOI: 10.1093/g3journal/jkae058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 02/29/2024] [Accepted: 03/11/2024] [Indexed: 03/22/2024]
Abstract
Streptococcus pneumoniae (the pneumococcus) is a globally distributed, human obligate opportunistic bacterial pathogen which, although often carried commensally, is also a significant cause of invasive disease. Apart from multi-drug resistant and virulent clones, the rate and direction of pneumococcal dissemination between different countries remains largely unknown. The ability for the pneumococcus to take a foothold in a country depends on existing population configuration, the extent of vaccine implementation, as well as human mobility since it is a human obligate bacterium. To shed light on its international movement, we used extensive genome data from the Global Pneumococcal Sequencing project and estimated migration parameters between multiple countries in Africa. Data on allele frequencies of polymorphisms at housekeeping-like loci for multiple different lineages circulating in the populations of South Africa, Malawi, Kenya, and The Gambia were used to calculate the fixation index (Fst) between countries. We then further used these summaries to fit migration coalescent models with the likelihood-free inference algorithms available in the ELFI software package. Synthetic datawere additionally used to validate the inference approach. Our results demonstrate country-pair specific migration patterns and heterogeneity in the extent of migration between different lineages. Our approach demonstrates that coalescent models can be effectively used for inferring migration rates for bacterial species and lineages provided sufficiently granular population genomics surveillance data. Further, it can demonstrate the connectivity of respiratory disease agents between countries to inform intervention policy in the longer term.
Collapse
Affiliation(s)
- Sophie Belman
- Parasites and Microbes, Wellcome Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Henri Pesonen
- Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Oslo, 0372, Norway
| | - Nicholas J Croucher
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, White City Campus, Imperial College London, London W12 0BZ, UK
| | - Stephen D Bentley
- Parasites and Microbes, Wellcome Sanger Institute, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Jukka Corander
- Department of Biostatistics, University of Oslo, Oslo, 0372, Norway
- Helsinki Institute for Information Technology HIIT, Department of Mathematics and Statistics, University of Helsinki, Espoo, Helsinki, 02150, Finland
| |
Collapse
|
14
|
Daron J, Bouafou L, Tennessen JA, Rahola N, Makanga B, Akone-Ella O, Ngangue MF, Longo Pendy NM, Paupy C, Neafsey DE, Fontaine MC, Ayala D. Genomic Signatures of Microgeographic Adaptation in Anopheles coluzzii Along an Anthropogenic Gradient in Gabon. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.16.594472. [PMID: 38798379 PMCID: PMC11118577 DOI: 10.1101/2024.05.16.594472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Species distributed across heterogeneous environments often evolve locally adapted populations, but understanding how these persist in the presence of homogenizing gene flow remains puzzling. In Gabon, Anopheles coluzzii, a major African malaria mosquito is found along an ecological gradient, including a sylvatic population, away of any human presence. This study identifies into the genomic signatures of local adaptation in populations from distinct environments including the urban area of Libreville, and two proximate sites 10km apart in the La Lopé National Park (LLP), a village and its sylvatic neighborhood. Whole genome re-sequencing of 96 mosquitoes unveiled ∼ 5.7millions high-quality single nucleotide polymorphisms. Coalescent-based demographic analyses suggest an ∼ 8,000-year-old divergence between Libreville and La Lopé populations, followed by a secondary contact ( ∼ 4,000 ybp) resulting in asymmetric effective gene flow. The urban population displayed reduced effective size, evidence of inbreeding, and strong selection pressures for adaptation to urban settings, as suggested by the hard selective sweeps associated with genes involved in detoxification and insecticide resistance. In contrast, the two geographically proximate LLP populations showed larger effective sizes, and distinctive genomic differences in selective signals, notably soft-selective sweeps on the standing genetic variation. Although neutral loci and chromosomal inversions failed to discriminate between LLP populations, our findings support that microgeographic adaptation can swiftly emerge through selection on standing genetic variation despite high gene flow. This study contributes to the growing understanding of evolution of populations in heterogeneous environments amid ongoing gene flow and how major malaria mosquitoes adapt to human. Significance Anopheles coluzzii , a major African malaria vector, thrives from humid rainforests to dry savannahs and coastal areas. This ecological success is linked to its close association with domestic settings, with human playing significant roles in driving the recent urban evolution of this mosquito. Our research explores the assumption that these mosquitoes are strictly dependent on human habitats, by conducting whole-genome sequencing on An. coluzzii specimens from urban, rural, and sylvatic sites in Gabon. We found that urban mosquitoes show de novo genetic signatures of human-driven vector control, while rural and sylvatic mosquitoes exhibit distinctive genetic evidence of local adaptations derived from standing genetic variation. Understanding adaptation mechanisms of this mosquito is therefore crucial to predict evolution of vector control strategies.
Collapse
|
15
|
Johnson OL, Tobler R, Schmidt JM, Huber CD. Population genetic simulation: Benchmarking frameworks for non-standard models of natural selection. Mol Ecol Resour 2024; 24:e13930. [PMID: 38247258 PMCID: PMC10932895 DOI: 10.1111/1755-0998.13930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/21/2023] [Accepted: 01/09/2024] [Indexed: 01/23/2024]
Abstract
Population genetic simulation has emerged as a common tool for investigating increasingly complex evolutionary and demographic models. Software capable of handling high-level model complexity has recently been developed, and the advancement of tree sequence recording now allows simulations to merge the efficiency and genealogical insight of coalescent simulations with the flexibility of forward simulations. However, frameworks utilizing these features have not yet been compared and benchmarked. Here, we evaluate various simulation workflows using the coalescent simulator msprime and the forward simulator SLiM, to assess resource efficiency and determine an optimal simulation framework. Three aspects were evaluated: (1) the burn-in, to establish an equilibrium level of neutral diversity in the population; (2) the forward simulation, in which temporally fluctuating selection is acting; and (3) the final computation of summary statistics. We provide typical memory and computation time requirements for each step. We find that the fastest framework, a combination of coalescent and forward simulation with tree sequence recording, increases simulation speed by over twenty times compared to classical forward simulations without tree sequence recording, although it does require six times more memory. Overall, using efficient simulation workflows can lead to a substantial improvement when modelling complex evolutionary scenarios-although the optimal framework ultimately depends on the available computational resources.
Collapse
Affiliation(s)
| | - Raymond Tobler
- Evolution of Cultural Diversity Initiative, The Australian National University, Australia
| | - Joshua M. Schmidt
- Department of Ophthalmology, College of Medicine and Public Health, Flinders University, Australia
| | - Christian D. Huber
- School of Biological Sciences, University of Adelaide, Australia
- Department of Biology, Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
16
|
Song H, Chu J, Li W, Li X, Fang L, Han J, Zhao S, Ma Y. A Novel Approach Utilizing Domain Adversarial Neural Networks for the Detection and Classification of Selective Sweeps. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2304842. [PMID: 38308186 PMCID: PMC11005742 DOI: 10.1002/advs.202304842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/10/2024] [Indexed: 02/04/2024]
Abstract
The identification and classification of selective sweeps are of great significance for improving the understanding of biological evolution and exploring opportunities for precision medicine and genetic improvement. Here, a domain adaptation sweep detection and classification (DASDC) method is presented to balance the alignment of two domains and the classification performance through a domain-adversarial neural network and its adversarial learning modules. DASDC effectively addresses the issue of mismatch between training data and real genomic data in deep learning models, leading to a significant improvement in its generalization capability, prediction robustness, and accuracy. The DASDC method demonstrates improved identification performance compared to existing methods and excels in classification performance, particularly in scenarios where there is a mismatch between application data and training data. The successful implementation of DASDC in real data of three distinct species highlights its potential as a useful tool for identifying crucial functional genes and investigating adaptive evolutionary mechanisms, particularly with the increasing availability of genomic data.
Collapse
Affiliation(s)
- Hui Song
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Jinyu Chu
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Wangjiao Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
| | - Xinyun Li
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
| | - Lingzhao Fang
- Center for Quantitative Genetics and GenomicsAarhus UniversityAarhus8000Denmark
| | - Jianlin Han
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- CAAS‐ILRI Joint Laboratory on Livestock and Forage Genetic ResourcesInstitute of Animal ScienceChinese Academy of Agricultural Sciences (CAAS)Beijing100193China
- Livestock Genetics ProgramInternational Livestock Research Institute (ILRI)Nairobi00100Kenya
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| | - Yunlong Ma
- Key Laboratory of Agricultural Animal GeneticsBreeding, and Reproduction of the Ministry of Education & Key Laboratory of Swine Genetics and Breeding of the Ministry of AgricultureHuazhong Agricultural UniversityWuhan430070China
- Hubei Hongshan LaboratoryWuhan430070China
- Lingnan Modern Agricultural Science and Technology Guangdong LaboratoryGuangzhou510642China
| |
Collapse
|
17
|
Thom G, Moreira LR, Batista R, Gehara M, Aleixo A, Smith BT. Genomic Architecture Predicts Tree Topology, Population Structuring, and Demographic History in Amazonian Birds. Genome Biol Evol 2024; 16:evae002. [PMID: 38236173 PMCID: PMC10823491 DOI: 10.1093/gbe/evae002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 10/26/2023] [Accepted: 12/12/2023] [Indexed: 01/19/2024] Open
Abstract
Geographic barriers are frequently invoked to explain genetic structuring across the landscape. However, inferences on the spatial and temporal origins of population variation have been largely limited to evolutionary neutral models, ignoring the potential role of natural selection and intrinsic genomic processes known as genomic architecture in producing heterogeneity in differentiation across the genome. To test how variation in genomic characteristics (e.g. recombination rate) impacts our ability to reconstruct general patterns of differentiation between species that cooccur across geographic barriers, we sequenced the whole genomes of multiple bird populations that are distributed across rivers in southeastern Amazonia. We found that phylogenetic relationships within species and demographic parameters varied across the genome in predictable ways. Genetic diversity was positively associated with recombination rate and negatively associated with species tree support. Gene flow was less pervasive in genomic regions of low recombination, making these windows more likely to retain patterns of population structuring that matched the species tree. We further found that approximately a third of the genome showed evidence of selective sweeps and linked selection, skewing genome-wide estimates of effective population sizes and gene flow between populations toward lower values. In sum, we showed that the effects of intrinsic genomic characteristics and selection can be disentangled from neutral processes to elucidate spatial patterns of population differentiation.
Collapse
Affiliation(s)
- Gregory Thom
- Department of Ornithology, American Museum of Natural History, New York, NY, USA
- Museum of Natural Science, Louisiana State University, Baton Rouge, LA, USA
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA
| | - Lucas Rocha Moreira
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
- Department of Vertebrate Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Romina Batista
- Programa de Coleções Biológicas, Instituto Nacional de Pesquisas da Amazônia, Manaus, Brazil
- School of Science, Engineering and Environment, University of Salford, Manchester, UK
| | - Marcelo Gehara
- Department of Earth and Environmental Sciences, Rutgers University, Newark, NJ, USA
| | - Alexandre Aleixo
- Finnish Museum of Natural History, University of Helsinki, Helsinki, Finland
- Department of Environmental Genomics, Instituto Tecnológico Vale, Belém, Brazil
| | - Brian Tilston Smith
- Department of Ornithology, American Museum of Natural History, New York, NY, USA
| |
Collapse
|
18
|
Szpiech ZA. selscan 2.0: scanning for sweeps in unphased data. Bioinformatics 2024; 40:btae006. [PMID: 38180866 PMCID: PMC10789311 DOI: 10.1093/bioinformatics/btae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 12/26/2023] [Accepted: 01/03/2024] [Indexed: 01/07/2024] Open
Abstract
SUMMARY Several popular haplotype-based statistics for identifying recent or ongoing positive selection in genomes require knowledge of haplotype phase. Here, we provide an update to selscan which implements a re-definition of these statistics for use in unphased data. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available at https://github.com/szpiech/selscan, implemented in C/C++, and supported on Linux, Windows, and MacOS.
Collapse
Affiliation(s)
- Zachary A Szpiech
- Department of Biology, Penn State University, University Park, PA 16802, United States
- Institute for Computational and Data Sciences, Penn State University, University Park, PA 16802, United States
| |
Collapse
|
19
|
Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genet 2024; 20:e1011110. [PMID: 38236805 PMCID: PMC10796009 DOI: 10.1371/journal.pgen.1011110] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2024] Open
Abstract
In the presence of recombination, the evolutionary relationships between a set of sampled genomes cannot be described by a single genealogical tree. Instead, the genomes are related by a complex, interwoven collection of genealogies formalized in a structure called an ancestral recombination graph (ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is replete with valuable information for addressing diverse questions in evolutionary biology. Despite its potential utility, technological and methodological limitations, along with a lack of approachable literature, have severely restricted awareness and application of ARGs in evolution research. Excitingly, recent progress in ARG reconstruction and simulation have made ARG-based approaches feasible for many questions and systems. In this review, we provide an accessible introduction and exploration of ARGs, survey recent methodological breakthroughs, and describe the potential for ARGs to further existing goals and open avenues of inquiry that were previously inaccessible in evolutionary genomics. Through this discussion, we aim to more widely disseminate the promise of ARGs in evolutionary genomics and encourage the broader development and adoption of ARG-based inference.
Collapse
Affiliation(s)
- Alexander L. Lewanski
- Department of Integrative Biology, Michigan State University, East Lansing, Michigan, United States of America
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Michael C. Grundler
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gideon S. Bradburd
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
20
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. PLoS Genet 2023; 19:e1011032. [PMID: 37934781 PMCID: PMC10655966 DOI: 10.1371/journal.pgen.1011032] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 11/17/2023] [Accepted: 10/23/2023] [Indexed: 11/09/2023] Open
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
21
|
Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: an empiricist's guide to ancestral recombination graphs. ARXIV 2023:arXiv:2310.12070v1. [PMID: 37904740 PMCID: PMC10614969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
In the presence of recombination, the evolutionary relationships between a set of sampled genomes cannot be described by a single genealogical tree. Instead, the genomes are related by a complex, interwoven collection of genealogies formalized in a structure called an ancestral recombination graph (ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is replete with valuable information for addressing diverse questions in evolutionary biology. Despite its potential utility, technological and methodological limitations, along with a lack of approachable literature, have severely restricted awareness and application of ARGs in empirical evolution research. Excitingly, recent progress in ARG reconstruction and simulation have made ARG-based approaches feasible for many questions and systems. In this review, we provide an accessible introduction and exploration of ARGs, survey recent methodological breakthroughs, and describe the potential for ARGs to further existing goals and open avenues of inquiry that were previously inaccessible in evolutionary genomics. Through this discussion, we aim to more widely disseminate the promise of ARGs in evolutionary genomics and encourage the broader development and adoption of ARG-based inference.
Collapse
Affiliation(s)
- Alexander L Lewanski
- Department of Integrative Biology, Michigan State University, East Lansing, MI, US
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, US
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI, US
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| | - Michael C Grundler
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| | - Gideon S Bradburd
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, US
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| |
Collapse
|
22
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor Decomposition-based Feature Extraction and Classification to Detect Natural Selection from Genomic Data. Mol Biol Evol 2023; 40:msad216. [PMID: 37772983 PMCID: PMC10581699 DOI: 10.1093/molbev/msad216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 08/10/2023] [Accepted: 09/14/2023] [Indexed: 09/30/2023] Open
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
23
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.01.529396. [PMID: 36909514 PMCID: PMC10002701 DOI: 10.1101/2023.03.01.529396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| |
Collapse
|
24
|
Arnab SP, Amin MR, DeGiorgio M. Uncovering Footprints of Natural Selection Through Spectral Analysis of Genomic Summary Statistics. Mol Biol Evol 2023; 40:msad157. [PMID: 37433019 PMCID: PMC10365025 DOI: 10.1093/molbev/msad157] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 06/28/2023] [Accepted: 07/06/2023] [Indexed: 07/13/2023] Open
Abstract
Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
25
|
Booker WW, Ray DD, Schrider DR. This population does not exist: learning the distribution of evolutionary histories with generative adversarial networks. Genetics 2023; 224:iyad063. [PMID: 37067864 PMCID: PMC10213497 DOI: 10.1093/genetics/iyad063] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 02/23/2023] [Accepted: 04/05/2023] [Indexed: 04/18/2023] Open
Abstract
Numerous studies over the last decade have demonstrated the utility of machine learning methods when applied to population genetic tasks. More recent studies show the potential of deep-learning methods in particular, which allow researchers to approach problems without making prior assumptions about how the data should be summarized or manipulated, instead learning their own internal representation of the data in an attempt to maximize inferential accuracy. One type of deep neural network, called Generative Adversarial Networks (GANs), can even be used to generate new data, and this approach has been used to create individual artificial human genomes free from privacy concerns. In this study, we further explore the application of GANs in population genetics by designing and training a network to learn the statistical distribution of population genetic alignments (i.e. data sets consisting of sequences from an entire population sample) under several diverse evolutionary histories-the first GAN capable of performing this task. After testing multiple different neural network architectures, we report the results of a fully differentiable Deep-Convolutional Wasserstein GAN with gradient penalty that is capable of generating artificial examples of population genetic alignments that successfully mimic key aspects of the training data, including the site-frequency spectrum, differentiation between populations, and patterns of linkage disequilibrium. We demonstrate consistent training success across various evolutionary models, including models of panmictic and subdivided populations, populations at equilibrium and experiencing changes in size, and populations experiencing either no selection or positive selection of various strengths, all without the need for extensive hyperparameter tuning. Overall, our findings highlight the ability of GANs to learn and mimic population genetic data and suggest future areas where this work can be applied in population genetics research that we discuss herein.
Collapse
Affiliation(s)
- William W Booker
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27514-2916, USA
| |
Collapse
|
26
|
Moran RL, Richards EJ, Ornelas-García CP, Gross JB, Donny A, Wiese J, Keene AC, Kowalko JE, Rohner N, McGaugh SE. Selection-driven trait loss in independently evolved cavefish populations. Nat Commun 2023; 14:2557. [PMID: 37137902 PMCID: PMC10156726 DOI: 10.1038/s41467-023-37909-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Accepted: 04/03/2023] [Indexed: 05/05/2023] Open
Abstract
Laboratory studies have demonstrated that a single phenotype can be produced by many different genotypes; however, in natural systems, it is frequently found that phenotypic convergence is due to parallel genetic changes. This suggests a substantial role for constraint and determinism in evolution and indicates that certain mutations are more likely to contribute to phenotypic evolution. Here we use whole genome resequencing in the Mexican tetra, Astyanax mexicanus, to investigate how selection has shaped the repeated evolution of both trait loss and enhancement across independent cavefish lineages. We show that selection on standing genetic variation and de novo mutations both contribute substantially to repeated adaptation. Our findings provide empirical support for the hypothesis that genes with larger mutational targets are more likely to be the substrate of repeated evolution and indicate that features of the cave environment may impact the rate at which mutations occur.
Collapse
Affiliation(s)
- Rachel L Moran
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN, USA.
- Department of Biology, Texas A&M University, College Station, TX, USA.
| | - Emilie J Richards
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN, USA
| | - Claudia Patricia Ornelas-García
- Colección Nacional de Peces, Departamento de Zoología, Instituto de Biología, Universidad Nacional Autónoma de México, Tercer Circuito Exterior S/N. CP 04510, D. F. México, México City, México
| | - Joshua B Gross
- Department of Biological Sciences, University of Cincinnati, Cincinnati, OH, USA
| | - Alexandra Donny
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN, USA
| | - Jonathan Wiese
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN, USA
| | - Alex C Keene
- Department of Biology, Texas A&M University, College Station, TX, USA
| | - Johanna E Kowalko
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Nicolas Rohner
- Stowers Institute for Medical Research, Kansas City, MO, USA
- Department of Molecular & Integrative Physiology, KU Medical Center, Kansas City, KS, USA
| | - Suzanne E McGaugh
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Saint Paul, MN, USA
| |
Collapse
|
27
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor decomposition based feature extraction and classification to detect natural selection from genomic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.27.527731. [PMID: 37034767 PMCID: PMC10081272 DOI: 10.1101/2023.03.27.527731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under non-convex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data while preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx , which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
|
28
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
29
|
Gower G, Ragsdale AP, Bisschop G, Gutenkunst RN, Hartfield M, Noskova E, Schiffels S, Struck TJ, Kelleher J, Thornton KR. Demes: a standard format for demographic models. Genetics 2022; 222:iyac131. [PMID: 36173327 PMCID: PMC9630982 DOI: 10.1093/genetics/iyac131] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Accepted: 08/23/2022] [Indexed: 11/12/2022] Open
Abstract
Understanding the demographic history of populations is a key goal in population genetics, and with improving methods and data, ever more complex models are being proposed and tested. Demographic models of current interest typically consist of a set of discrete populations, their sizes and growth rates, and continuous and pulse migrations between those populations over a number of epochs, which can require dozens of parameters to fully describe. There is currently no standard format to define such models, significantly hampering progress in the field. In particular, the important task of translating the model descriptions in published work into input suitable for population genetic simulators is labor intensive and error prone. We propose the Demes data model and file format, built on widely used technologies, to alleviate these issues. Demes provide a well-defined and unambiguous model of populations and their properties that is straightforward to implement in software, and a text file format that is designed for simplicity and clarity. We provide thoroughly tested implementations of Demes parsers in multiple languages including Python and C, and showcase initial support in several simulators and inference methods. An introduction to the file format and a detailed specification are available at https://popsim-consortium.github.io/demes-spec-docs/.
Collapse
Affiliation(s)
- Graham Gower
- Section for Molecular Ecology and Evolution, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Gertjan Bisschop
- Institute of Ecology and Evolution, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Ryan N Gutenkunst
- Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Matthew Hartfield
- Institute of Ecology and Evolution, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Ekaterina Noskova
- Computer Technologies Laboratory, ITMO University, 197101 Saint-Petersburg, Russia
| | - Stephan Schiffels
- Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany
| | - Travis J Struck
- Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Kevin R Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| |
Collapse
|
30
|
Shchur V, Spirin V, Sirotkin D, Burovski E, De Maio N, Corbett-Detig R. VGsim: Scalable viral genealogy simulator for global pandemic. PLoS Comput Biol 2022; 18:e1010409. [PMID: 36001646 PMCID: PMC9447924 DOI: 10.1371/journal.pcbi.1010409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 09/06/2022] [Accepted: 07/18/2022] [Indexed: 11/24/2022] Open
Abstract
Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape. We develop a fast and flexible simulation software package VGsim for modeling epidemiological processes and generating genealogies of large pathogen samples. The software takes into account host population structure, pathogen evolution, host immunity and some other epidemiological aspects. The computational efficiency of the package allows to simulate genealogies of tens of millions of samples, which is important, e.g., for SARS-CoV-2 genome studies.
Collapse
Affiliation(s)
- Vladimir Shchur
- International laboratory of statistical and computational genomics, HSE University, Moscow, Russia
- * E-mail:
| | - Vadim Spirin
- International laboratory of statistical and computational genomics, HSE University, Moscow, Russia
| | - Dmitry Sirotkin
- International laboratory of statistical and computational genomics, HSE University, Moscow, Russia
| | | | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering and Genomics Institute, UC Santa Cruz, California, United States of America
| |
Collapse
|
31
|
Lin X, Zhang N, Song H, Lin K, Pang E. Population-specific, recent positive selection signatures in cultivated Cucumis sativus L. (cucumber). G3 GENES|GENOMES|GENETICS 2022; 12:6585339. [PMID: 35554526 PMCID: PMC9258548 DOI: 10.1093/g3journal/jkac119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 05/03/2022] [Indexed: 11/13/2022]
Abstract
Population-specific, positive selection promotes the diversity of populations and drives local adaptations in the population. However, little is known about population-specific, recent positive selection in the populations of cultivated cucumber (Cucumis sativus L.). Based on a genomic variation map of individuals worldwide, we implemented a Fisher’s combination method by combining 4 haplotype-based approaches: integrated haplotype score (iHS), number of segregating sites by length (nSL), cross-population extended haplotype homozygosity (XP-EHH), and Rsb. Overall, we detected 331, 2,147, and 3,772 population-specific, recent positive selective sites in the East Asian, Eurasian, and Xishuangbanna populations, respectively. Moreover, we found that these sites were related to processes for reproduction, response to abiotic and biotic stress, and regulation of developmental processes, indicating adaptations to their microenvironments. Meanwhile, the selective genes associated with traits of fruits were also observed, such as the gene related to the shorter fruit length in the Eurasian population and the gene controlling flesh thickness in the Xishuangbanna population. In addition, we noticed that soft sweeps were common in the East Asian and Xishuangbanna populations. Genes involved in hard or soft sweeps were related to developmental regulation and abiotic and biotic stress resistance. Our study offers a comprehensive candidate dataset of population-specific, selective signatures in cultivated cucumber populations. Our methods provide guidance for the analysis of population-specific, positive selection. These findings will help explore the biological mechanisms of adaptation and domestication of cucumber.
Collapse
Affiliation(s)
- Xinrui Lin
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and Beijing Key Laboratory of Gene Resource and Molecular Development, College of Life Sciences, Beijing Normal University , Beijing 100875, China
| | - Ning Zhang
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and Beijing Key Laboratory of Gene Resource and Molecular Development, College of Life Sciences, Beijing Normal University , Beijing 100875, China
| | - Hongtao Song
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and Beijing Key Laboratory of Gene Resource and Molecular Development, College of Life Sciences, Beijing Normal University , Beijing 100875, China
| | - Kui Lin
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and Beijing Key Laboratory of Gene Resource and Molecular Development, College of Life Sciences, Beijing Normal University , Beijing 100875, China
| | - Erli Pang
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and Beijing Key Laboratory of Gene Resource and Molecular Development, College of Life Sciences, Beijing Normal University , Beijing 100875, China
| |
Collapse
|
32
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschumar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022; 220:iyab229. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 183] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany
| | | | - Jared G Galloway
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA
- Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Warren W Kretzschumar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, State College, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | | | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Peter L Ralph
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
33
|
Moran RL, Jaggard JB, Roback EY, Kenzior A, Rohner N, Kowalko JE, Ornelas-García CP, McGaugh SE, Keene AC. Hybridization underlies localized trait evolution in cavefish. iScience 2022; 25:103778. [PMID: 35146393 PMCID: PMC8819016 DOI: 10.1016/j.isci.2022.103778] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 09/13/2021] [Accepted: 01/12/2022] [Indexed: 11/04/2022] Open
Abstract
Introgressive hybridization may play an integral role in local adaptation and speciation (Taylor and Larson, 2019). In the Mexican tetra Astyanax mexicanus, cave populations have repeatedly evolved traits including eye loss, sleep loss, and albinism. Of the 30 caves inhabited by A. mexicanus, Chica cave is unique because it contains multiple pools inhabited by putative hybrids between surface and cave populations (Mitchell et al., 1977), providing an opportunity to investigate the impact of hybridization on complex trait evolution. We show that hybridization between cave and surface populations may contribute to localized variation in traits associated with cave evolution, including pigmentation, eye development, and sleep. We also uncover an example of convergent evolution in a circadian clock gene in multiple cavefish lineages and burrowing mammals, suggesting a shared genetic mechanism underlying circadian disruption in subterranean vertebrates. Our results provide insight into the role of hybridization in facilitating phenotypic evolution. Hybridization leads to a localized difference in sleep duration within a single cave Genomic analysis identifies coding differences in Cry1A across cave pools Changes in Cry1A appear to be conserved in cavefish and burrowing mammals
Collapse
|
34
|
Mueller JC, Botero-Delgadillo E, Espíndola-Hernández P, Gilsenan C, Ewels P, Gruselius J, Kempenaers B. Local selection signals in the genome of Blue tits emphasize regulatory and neuronal evolution. Mol Ecol 2022; 31:1504-1514. [PMID: 34995389 DOI: 10.1111/mec.16345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 11/18/2021] [Accepted: 12/15/2021] [Indexed: 11/30/2022]
Abstract
Understanding the genomic landscape of adaptation is central to the understanding of microevolution in wild populations. Genomic targets of selection and the underlying genomic mechanisms of adaptation can be elucidated by genome-wide scans for past selective sweeps or by scans for direct fitness associations. We sequenced and assembled 150 haplotypes of 75 Blue tits (Cyanistes caeruleus) of a single central-European population by a linked-read technology. We used these genome data in combination with coalescent simulations (1) to estimate an historical effective population size of ~250,000, which recently declined to ~10,000, and (2) to identify genome-wide distributed selective sweeps of beneficial variants most likely originating from standing genetic variation (soft sweeps). The genes linked to these soft sweeps, but also the ones linked to hard sweeps based on new beneficial mutants, showed a significant enrichment for functions associated with gene expression and transcription regulation. This emphasizes the importance of regulatory evolution in the population's adaptive history. Soft sweeps were further enriched for genes related to axon and synapse development, indicating the significance of neuronal connectivity changes in the brain potentially linked to behavioural adaptations. A previous scan of heterozygosity-fitness correlations revealed a consistent negative effect on arrival date at the breeding site for a single microsatellite in the MDGA2 gene. Here, we used the haplotype structure around this microsatellite to explain the effect as a local and direct outbreeding effect of a gene involved in synapse development.
Collapse
Affiliation(s)
- Jakob C Mueller
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Esteban Botero-Delgadillo
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Pamela Espíndola-Hernández
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Carol Gilsenan
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| | - Phil Ewels
- Science for Life Laboratory (SciLifeLab), Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Joel Gruselius
- Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, Stockholm, Sweden.,current address: Vanadis Diagnostics, PerkinElmer, Sollentuna, Sweden
| | - Bart Kempenaers
- Department of Behavioural Ecology and Evolutionary Genetics, Max Planck Institute for Ornithology, Seewiesen, Germany
| |
Collapse
|
35
|
Hejase HA, Mo Z, Campagna L, Siepel A. A Deep-Learning Approach for Inference of Selective Sweeps from the Ancestral Recombination Graph. Mol Biol Evol 2022; 39:msab332. [PMID: 34888675 PMCID: PMC8789311 DOI: 10.1093/molbev/msab332] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Detecting signals of selection from genomic data is a central problem in population genetics. Coupling the rich information in the ancestral recombination graph (ARG) with a powerful and scalable deep-learning framework, we developed a novel method to detect and quantify positive selection: Selection Inference using the Ancestral recombination graph (SIA). Built on a Long Short-Term Memory (LSTM) architecture, a particular type of a Recurrent Neural Network (RNN), SIA can be trained to explicitly infer a full range of selection coefficients, as well as the allele frequency trajectory and time of selection onset. We benchmarked SIA extensively on simulations under a European human demographic model, and found that it performs as well or better as some of the best available methods, including state-of-the-art machine-learning and ARG-based methods. In addition, we used SIA to estimate selection coefficients at several loci associated with human phenotypes of interest. SIA detected novel signals of selection particular to the European (CEU) population at the MC1R and ABCC11 loci. In addition, it recapitulated signals of selection at the LCT locus and several pigmentation-related genes. Finally, we reanalyzed polymorphism data of a collection of recently radiated southern capuchino seedeater taxa in the genus Sporophila to quantify the strength of selection and improved the power of our previous methods to detect partial soft sweeps. Overall, SIA uses deep learning to leverage the ARG and thereby provides new insight into how selective sweeps shape genomic diversity.
Collapse
Affiliation(s)
- Hussein A Hejase
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Leonardo Campagna
- Fuller Evolutionary Biology Program, Cornell Lab of Ornithology, Ithaca, NY, USA
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
36
|
Shchur V, Spirin V, Sirotkin D, Burovski E, De Maio N, Corbett-Detig R. VGsim: scalable viral genealogy simulator for global pandemic. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.04.21.21255891. [PMID: 33948608 PMCID: PMC8095227 DOI: 10.1101/2021.04.21.21255891] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Accurate simulation of complex biological processes is an essential component of developing and validating new technologies and inference approaches. As an effort to help contain the COVID-19 pandemic, large numbers of SARS-CoV-2 genomes have been sequenced from most regions in the world. More than 5.5 million viral sequences are publicly available as of November 2021. Many studies estimate viral genealogies from these sequences, as these can provide valuable information about the spread of the pandemic across time and space. Additionally such data are a rich source of information about molecular evolutionary processes including natural selection, for example allowing the identification of new variants with transmissibility and immunity evasion advantages. To our knowledge, there is no framework that is both efficient and flexible enough to simulate the pandemic to approximate world-scale scenarios and generate viral genealogies of millions of samples. Here, we introduce a new fast simulator VGsim which addresses the problem of simulation genealogies under epidemiological models. The simulation process is split into two phases. During the forward run the algorithm generates a chain of population-level events reflecting the dynamics of the pandemic using an hierarchical version of the Gillespie algorithm. During the backward run a coalescent-like approach generates a tree genealogy of samples conditioning on the population-level events chain generated during the forward run. Our software can model complex population structure, epistasis and immunity escape. The code is freely available at https://github.com/Genomics-HSE/VGsim.
Collapse
Affiliation(s)
| | | | | | | | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Russell Corbett-Detig
- HSE University, Russian Federation
- Department of Biomolecular Engineering and Genomics Institute, UC Santa Cruz, California 95064
| |
Collapse
|
37
|
O'Gorman M, Thakur S, Imrie G, Moran RL, Choy S, Sifuentes-Romero I, Bilandžija H, Renner KJ, Duboué E, Rohner N, McGaugh SE, Keene AC, Kowalko JE. Pleiotropic function of the oca2 gene underlies the evolution of sleep loss and albinism in cavefish. Curr Biol 2021; 31:3694-3701.e4. [PMID: 34293332 DOI: 10.1016/j.cub.2021.06.077] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Revised: 03/22/2021] [Accepted: 06/25/2021] [Indexed: 12/29/2022]
Abstract
Adaptation to novel environments often involves the evolution of multiple morphological, physiological, and behavioral traits. One striking example of multi-trait evolution is the suite of traits that has evolved repeatedly in cave animals, including regression of eyes, loss of pigmentation, and enhancement of non-visual sensory systems.1,2 The Mexican tetra, Astyanax mexicanus, consists of fish that inhabit at least 30 caves in Mexico and ancestral-like surface fish that inhabit the rivers of Mexico and southern Texas.3 Cave A. mexicanus are interfertile with surface fish and have evolved a number of traits, including reduced pigmentation, eye loss, and alterations to behavior.4-6 To define relationships between different cave-evolved traits, we phenotyped 208 surface-cave F2 hybrid fish for numerous morphological and behavioral traits. We found differences in sleep between pigmented and albino hybrid fish, raising the possibility that these traits share a genetic basis. In cavefish and other species, mutations in oculocutaneous albinism 2 (oca2) cause albinism.7-12 Surface fish with mutations in oca2 displayed both albinism and reduced sleep. Further, this mutation in oca2 fails to complement sleep loss when surface fish harboring this engineered mutation are crossed to independently evolved populations of albino cavefish with naturally occurring mutations in oca2. Analysis of the oca2 locus in wild-caught cave and surface fish suggests that oca2 is under positive selection in 3 cave populations. Taken together, these findings identify oca2 as a novel regulator of sleep and suggest that a pleiotropic function of oca2 underlies the adaptive evolution of albinism and sleep loss.
Collapse
Affiliation(s)
- Morgan O'Gorman
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA
| | - Sunishka Thakur
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA
| | - Gillian Imrie
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA
| | - Rachel L Moran
- Department of Ecology, Evolution, and Behavior. University of Minnesota, St. Paul, MN 55108, USA
| | - Stefan Choy
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA
| | | | - Helena Bilandžija
- Department of Molecular Biology, Rudjer Boskovic Institute, 10000 Zagreb, Croatia
| | - Kenneth J Renner
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA
| | - Erik Duboué
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA; Harriet L. Wilkes Honors College, Florida Atlantic University, Jupiter, FL 33458, USA
| | | | - Suzanne E McGaugh
- Department of Ecology, Evolution, and Behavior. University of Minnesota, St. Paul, MN 55108, USA
| | - Alex C Keene
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA; Department of Biology Science, Florida Atlantic University, Jupiter, FL 33458, USA.
| | - Johanna E Kowalko
- Jupiter Life Science Initiative, Florida Atlantic University, Jupiter, FL 33458, USA; Harriet L. Wilkes Honors College, Florida Atlantic University, Jupiter, FL 33458, USA.
| |
Collapse
|
38
|
Manthey JD, Klicka J, Spellman GM. The Genomic Signature of Allopatric Speciation in a Songbird Is Shaped by Genome Architecture (Aves: Certhia americana). Genome Biol Evol 2021; 13:evab120. [PMID: 34042960 PMCID: PMC8364988 DOI: 10.1093/gbe/evab120] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/24/2021] [Indexed: 12/31/2022] Open
Abstract
The genomic signature of speciation with gene flow is often attributed to the strength of divergent selection and recombination rate in regions harboring targets for selection. In contrast, allopatric speciation provides a different geographic context and evolutionary scenario, whereby introgression is limited by isolation rather than selection against gene flow. Lacking shared divergent selection or selection against hybridization, we would predict the genomic signature of allopatric speciation would largely be shaped by genomic architecture-the nonrandom distribution of functional elements and chromosomal characteristics-through its role in affecting the processes of selection and drift. Here, we built and annotated a chromosome-scale genome assembly for a songbird (Passeriformes: Certhia americana). We show that the genomic signature of allopatric speciation between its two primary lineages is largely shaped by genomic architecture. Regionally, gene density and recombination rate variation explain a large proportion of variance in genomic diversity, differentiation, and divergence. We identified a heterogeneous landscape of selection and neutrality, with a large portion of the genome under the effects of indirect selection. We found higher proportions of small chromosomes under the effects of indirect selection, likely because they have relatively higher gene density. At the chromosome scale, differential genomic architecture of macro- and microchromosomes shapes the genomic signatures of speciation: chromosome size has: 1) a positive relationship with genetic differentiation, genetic divergence, rate of lineage sorting in the contact zone, and proportion neutral evolution and 2) a negative relationship with genetic diversity and recombination rate.
Collapse
Affiliation(s)
- Joseph D Manthey
- Department of Biological Sciences, Texas Tech University, Lubbock, Texas, USA
| | - John Klicka
- Burke Museum of Natural History, University of Washington, Seattle, Washington, USA
- Department of Biology, University of Washington, Seattle, Washington, USA
| | - Garth M Spellman
- Department of Zoology, Denver Museum of Nature & Science, Denver, Colorado, USA
| |
Collapse
|
39
|
Szpiech ZA, Novak TE, Bailey NP, Stevison LS. Application of a novel haplotype-based scan for local adaptation to study high-altitude adaptation in rhesus macaques. Evol Lett 2021; 5:408-421. [PMID: 34367665 PMCID: PMC8327953 DOI: 10.1002/evl3.232] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 02/24/2021] [Accepted: 05/04/2021] [Indexed: 12/17/2022] Open
Abstract
When natural populations split and migrate to different environments, they may experience different selection pressures that can lead to local adaptation. To capture the genomic patterns of a local selective sweep, we develop XP-nSL, a genomic scan for local adaptation that compares haplotype patterns between two populations. We show that XP-nSL has power to detect ongoing and recently completed hard and soft sweeps, and we then apply this statistic to search for evidence of adaptation to high altitude in rhesus macaques. We analyze the whole genomes of 23 wild rhesus macaques captured at high altitude (mean altitude > 4000 m above sea level) to 22 wild rhesus macaques captured at low altitude (mean altitude < 500 m above sea level) and find evidence of local adaptation in the high-altitude population at or near 303 known genes and several unannotated regions. We find the strongest signal for adaptation at EGLN1, a classic target for convergent evolution in several species living in low oxygen environments. Furthermore, many of the 303 genes are involved in processes related to hypoxia, regulation of ROS, DNA damage repair, synaptic signaling, and metabolism. These results suggest that, beyond adapting via a beneficial mutation in one single gene, adaptation to high altitude in rhesus macaques is polygenic and spread across numerous important biological systems.
Collapse
Affiliation(s)
- Zachary A Szpiech
- Department of Biology Pennsylvania State University University Park Pennsylvania 16801.,Institute for Computational and Data Sciences Pennsylvania State University University Park Pennsylvania 16801.,Department of Biological Sciences Auburn University Auburn Ala 36842 USA
| | - Taylor E Novak
- Department of Biological Sciences Auburn University Auburn Ala 36842 USA
| | - Nick P Bailey
- Department of Biological Sciences Auburn University Auburn Ala 36842 USA
| | - Laurie S Stevison
- Department of Biological Sciences Auburn University Auburn Ala 36842 USA
| |
Collapse
|
40
|
Bourgeois YXC, Warren BH. An overview of current population genomics methods for the analysis of whole-genome resequencing data in eukaryotes. Mol Ecol 2021; 30:6036-6071. [PMID: 34009688 DOI: 10.1111/mec.15989] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 04/26/2021] [Accepted: 05/11/2021] [Indexed: 01/01/2023]
Abstract
Characterizing the population history of a species and identifying loci underlying local adaptation is crucial in functional ecology, evolutionary biology, conservation and agronomy. The constant improvement of high-throughput sequencing techniques has facilitated the production of whole genome data in a wide range of species. Population genomics now provides tools to better integrate selection into a historical framework, and take into account selection when reconstructing demographic history. However, this improvement has come with a profusion of analytical tools that can confuse and discourage users. Such confusion limits the amount of information effectively retrieved from complex genomic data sets, and impairs the diffusion of the most recent analytical tools into fields such as conservation biology. It may also lead to redundancy among methods. To address these isssues, we propose an overview of more than 100 state-of-the-art methods that can deal with whole genome data. We summarize the strategies they use to infer demographic history and selection, and discuss some of their limitations. A website listing these methods is available at www.methodspopgen.com.
Collapse
Affiliation(s)
| | - Ben H Warren
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS, Sorbonne Université, EPHE, UA, CP 51, Paris, France
| |
Collapse
|
41
|
Elhaik E, Graur D. On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn't. Genes (Basel) 2021; 12:genes12040527. [PMID: 33916341 PMCID: PMC8066263 DOI: 10.3390/genes12040527] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 03/22/2021] [Accepted: 03/29/2021] [Indexed: 12/12/2022] Open
Abstract
In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology, Lund University, Sölvegatan 35, 22362 Lund, Sweden
- Correspondence:
| | - Dan Graur
- Department of Biology & Biochemistry, University of Houston, Science & Research Building 2, Suite #342, 3455 Cullen Bldv., Houston, TX 77204-5001, USA;
| |
Collapse
|
42
|
Wang Z, Wang J, Kourakos M, Hoang N, Lee HH, Mathieson I, Mathieson S. Automatic inference of demographic parameters using generative adversarial networks. Mol Ecol Resour 2021; 21:2689-2705. [PMID: 33745225 PMCID: PMC8596911 DOI: 10.1111/1755-0998.13386] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 03/05/2021] [Indexed: 12/12/2022]
Abstract
Population genetics relies heavily on simulated data for validation, inference and intuition. In particular, since the evolutionary ‘ground truth’ for real data is always limited, simulated data are crucial for training supervised machine learning methods. Simulation software can accurately model evolutionary processes but requires many hand‐selected input parameters. As a result, simulated data often fail to mirror the properties of real genetic data, which limits the scope of methods that rely on it. Here, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method, pg‐gan, is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation‐with‐migration model. We then apply our method to human data from the 1000 Genomes Project and show that we can accurately recapitulate the features of real data.
Collapse
Affiliation(s)
- Zhanpeng Wang
- Department of Computer Science, Haverford College, Haverford, PA, USA
| | - Jiaping Wang
- Department of Computer Science, Haverford College, Haverford, PA, USA
| | - Michael Kourakos
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Nhung Hoang
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Hyong Hark Lee
- Department of Computer Science, Swarthmore College, Swarthmore, PA, USA
| | - Iain Mathieson
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, PA, USA
| |
Collapse
|
43
|
Xue AT, Schrider DR, Kern AD. Discovery of Ongoing Selective Sweeps within Anopheles Mosquito Populations Using Deep Learning. Mol Biol Evol 2021; 38:1168-1183. [PMID: 33022051 PMCID: PMC7947845 DOI: 10.1093/molbev/msaa259] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC's performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.
Collapse
Affiliation(s)
- Alexander T Xue
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| | - Andrew D Kern
- Institute of Ecology and Evolution, 5289 University of Oregon, Eugene, OR
| |
Collapse
|
44
|
Enard D, Petrov DA. Ancient RNA virus epidemics through the lens of recent adaptation in human genomes. Philos Trans R Soc Lond B Biol Sci 2020; 375:20190575. [PMID: 33012231 PMCID: PMC7702803 DOI: 10.1098/rstb.2019.0575] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Over the course of the last several million years of evolution, humans probably have been plagued by hundreds or perhaps thousands of epidemics. Little is known about such ancient epidemics and a deep evolutionary perspective on current pathogenic threats is lacking. The study of past epidemics has typically been limited in temporal scope to recorded history, and in physical scope to pathogens that left sufficient DNA behind, such as Yersinia pestis during the Great Plague. Host genomes, however, offer an indirect way to detect ancient epidemics beyond the current temporal and physical limits. Arms races with pathogens have shaped the genomes of the hosts by driving a large number of adaptations at many genes, and these signals can be used to detect and further characterize ancient epidemics. Here, we detect the genomic footprints left by ancient viral epidemics that took place in the past approximately 50 000 years in the 26 human populations represented in the 1000 Genomes Project. By using the enrichment in signals of adaptation at approximately 4500 host loci that interact with specific types of viruses, we provide evidence that RNA viruses have driven a particularly large number of adaptive events across diverse human populations. These results suggest that different types of viruses may have exerted different selective pressures during human evolution. Knowledge of these past selective pressures will provide a deeper evolutionary perspective on current pathogenic threats. This article is part of the theme issue ‘Insights into health and disease from ancient biomolecules’.
Collapse
Affiliation(s)
- David Enard
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, USA
| | - Dmitri A Petrov
- Department of Biology, Stanford University, Stanford, CA, USA
| |
Collapse
|
45
|
Schrider DR. Background Selection Does Not Mimic the Patterns of Genetic Diversity Produced by Selective Sweeps. Genetics 2020; 216:499-519. [PMID: 32847814 PMCID: PMC7536861 DOI: 10.1534/genetics.120.303469] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 08/04/2020] [Indexed: 12/28/2022] Open
Abstract
It is increasingly evident that natural selection plays a prominent role in shaping patterns of diversity across the genome. The most commonly studied modes of natural selection are positive selection and negative selection, which refer to directional selection for and against derived mutations, respectively. Positive selection can result in hitchhiking events, in which a beneficial allele rapidly replaces all others in the population, creating a valley of diversity around the selected site along with characteristic skews in allele frequencies and linkage disequilibrium among linked neutral polymorphisms. Similarly, negative selection reduces variation not only at selected sites but also at linked sites, a phenomenon called background selection (BGS). Thus, discriminating between these two forces may be difficult, and one might expect efforts to detect hitchhiking to produce an excess of false positives in regions affected by BGS. Here, we examine the similarity between BGS and hitchhiking models via simulation. First, we show that BGS may somewhat resemble hitchhiking in simplistic scenarios in which a region constrained by negative selection is flanked by large stretches of unconstrained sites, echoing previous results. However, this scenario does not mirror the actual spatial arrangement of selected sites across the genome. By performing forward simulations under more realistic scenarios of BGS, modeling the locations of protein-coding and conserved noncoding DNA in real genomes, we show that the spatial patterns of variation produced by BGS rarely mimic those of hitchhiking events. Indeed, BGS is not substantially more likely than neutrality to produce false signatures of hitchhiking. This holds for simulations modeled after both humans and Drosophila, and for several different demographic histories. These results demonstrate that appropriately designed scans for hitchhiking need not consider BGS's impact on false-positive rates. However, we do find evidence that BGS increases the false-negative rate for hitchhiking, an observation that demands further investigation.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27514
| |
Collapse
|
46
|
Mughal MR, Koch H, Huang J, Chiaromonte F, DeGiorgio M. Learning the properties of adaptive regions with functional data analysis. PLoS Genet 2020; 16:e1008896. [PMID: 32853200 PMCID: PMC7480868 DOI: 10.1371/journal.pgen.1008896] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 09/09/2020] [Accepted: 05/29/2020] [Indexed: 12/12/2022] Open
Abstract
Identifying regions of positive selection in genomic data remains a challenge in population genetics. Most current approaches rely on comparing values of summary statistics calculated in windows. We present an approach termed SURFDAWave, which translates measures of genetic diversity calculated in genomic windows to functional data. By transforming our discrete data points to be outputs of continuous functions defined over genomic space, we are able to learn the features of these functions that signify selection. This enables us to confidently identify complex modes of natural selection, including adaptive introgression. We are also able to predict important selection parameters that are responsible for shaping the inferred selection events. By applying our model to human population-genomic data, we recapitulate previously identified regions of selective sweeps, such as OCA2 in Europeans, and predict that its beneficial mutation reached a frequency of 0.02 before it swept 1,802 generations ago, a time when humans were relatively new to Europe. In addition, we identify BNC2 in Europeans as a target of adaptive introgression, and predict that it harbors a beneficial mutation that arose in an archaic human population that split from modern humans within the hypothesized modern human-Neanderthal divergence range.
Collapse
Affiliation(s)
- Mehreen R. Mughal
- Bioinformatics and Genomics at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Hillary Koch
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Jinguo Huang
- Bioinformatics and Genomics at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Francesca Chiaromonte
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Michael DeGiorgio
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, Florida, United States of America
| |
Collapse
|
47
|
Mueller JC, Carrete M, Boerno S, Kuhl H, Tella JL, Kempenaers B. Genes acting in synapses and neuron projections are early targets of selection during urban colonization. Mol Ecol 2020; 29:3403-3412. [DOI: 10.1111/mec.15451] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 04/08/2020] [Indexed: 02/06/2023]
Affiliation(s)
- Jakob C. Mueller
- Department of Behavioural Ecology & Evolutionary Genetics Max Planck Institute for Ornithology Seewiesen Germany
| | - Martina Carrete
- Department of Conservation Biology Estación Biológica de Doñana – CSIC Sevilla Spain
- Department of Physical, Chemical and Natural Systems University Pablo de Olavide Sevilla Spain
| | - Stefan Boerno
- Sequencing Core Facility Max Planck Institute for Molecular Genetics Berlin Germany
| | - Heiner Kuhl
- Sequencing Core Facility Max Planck Institute for Molecular Genetics Berlin Germany
- Department of Ecophysiology and Aquaculture Leibniz‐Institute of Freshwater Ecology and Inland Fisheries Berlin Germany
| | - José L. Tella
- Department of Conservation Biology Estación Biológica de Doñana – CSIC Sevilla Spain
| | - Bart Kempenaers
- Department of Behavioural Ecology & Evolutionary Genetics Max Planck Institute for Ornithology Seewiesen Germany
| |
Collapse
|
48
|
Abstract
Accurately inferring the genome-wide landscape of recombination rates in natural populations is a central aim in genomics, as patterns of linkage influence everything from genetic mapping to understanding evolutionary history. Here, we describe recombination landscape estimation using recurrent neural networks (ReLERNN), a deep learning method for estimating a genome-wide recombination map that is accurate even with small numbers of pooled or individually sequenced genomes. Rather than use summaries of linkage disequilibrium as its input, ReLERNN takes columns from a genotype alignment, which are then modeled as a sequence across the genome using a recurrent neural network. We demonstrate that ReLERNN improves accuracy and reduces bias relative to existing methods and maintains high accuracy in the face of demographic model misspecification, missing genotype calls, and genome inaccessibility. We apply ReLERNN to natural populations of African Drosophila melanogaster and show that genome-wide recombination landscapes, although largely correlated among populations, exhibit important population-specific differences. Lastly, we connect the inferred patterns of recombination with the frequencies of major inversions segregating in natural Drosophila populations.
Collapse
Affiliation(s)
- Jeffrey R Adrion
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| | - Jared G Galloway
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR
| |
Collapse
|
49
|
Hejase HA, Dukler N, Siepel A. From Summary Statistics to Gene Trees: Methods for Inferring Positive Selection. Trends Genet 2020; 36:243-258. [PMID: 31954511 PMCID: PMC7177178 DOI: 10.1016/j.tig.2019.12.008] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Revised: 11/15/2019] [Accepted: 12/11/2019] [Indexed: 01/01/2023]
Abstract
Methods to detect signals of natural selection from genomic data have traditionally emphasized the use of simple summary statistics. Here, we review a new generation of methods that consider combinations of conventional summary statistics and/or richer features derived from inferred gene trees and ancestral recombination graphs (ARGs). We also review recent advances in methods for population genetic simulation and ARG reconstruction. Finally, we describe opportunities for future work on a variety of related topics, including the genetics of speciation, estimation of selection coefficients, and inference of selection on polygenic traits. Together, these emerging methods offer promising new directions in the study of natural selection.
Collapse
Affiliation(s)
- Hussein A Hejase
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA.
| | - Noah Dukler
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
50
|
Abstract
Cognitive abilities can vary dramatically among species. The relative importance of social and ecological challenges in shaping cognitive evolution has been the subject of a long-running and recently renewed debate, but little work has sought to understand the selective dynamics underlying the evolution of cognitive abilities. Here, we investigate recent selection related to cognition in the paper wasp Polistes fuscatus-a wasp that has uniquely evolved visual individual recognition abilities. We generate high quality de novo genome assemblies and population genomic resources for multiple species of paper wasps and use a population genomic framework to interrogate the probable mode and tempo of cognitive evolution. Recent, strong, hard selective sweeps in P. fuscatus contain loci annotated with functions in long-term memory formation, mushroom body development, and visual processing, traits which have recently evolved in association with individual recognition. The homologous pathways are not under selection in closely related wasps that lack individual recognition. Indeed, the prevalence of candidate cognition loci within the strongest selective sweeps suggests that the evolution of cognitive abilities has been among the strongest selection pressures in P. fuscatus' recent evolutionary history. Detailed analyses of selective sweeps containing candidate cognition loci reveal multiple cases of hard selective sweeps within the last few thousand years on de novo mutations, mainly in noncoding regions. These data provide unprecedented insight into some of the processes by which cognition evolves.
Collapse
|