1
|
Hobolth A, Rivas-González I, Bladt M, Futschik A. Phase-type distributions in mathematical population genetics: An emerging framework. Theor Popul Biol 2024; 157:14-32. [PMID: 38460602 DOI: 10.1016/j.tpb.2024.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 02/29/2024] [Accepted: 03/04/2024] [Indexed: 03/11/2024]
Abstract
A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.
Collapse
Affiliation(s)
- Asger Hobolth
- Department of Mathematics, Aarhus University, Denmark.
| | | | - Mogens Bladt
- Department of Mathematical Sciences, University of Copenhagen, Denmark.
| | - Andreas Futschik
- Institute of Applied Statistics, Johannes Kepler University, Austria.
| |
Collapse
|
2
|
Guo B, Takala-Harrison S, O'Connor TD. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in Plasmodium falciparum. bioRxiv 2024:2024.05.04.592538. [PMID: 38746392 PMCID: PMC11092787 DOI: 10.1101/2024.05.04.592538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Genomic surveillance is crucial for identifying at-risk populations for targeted malaria control and elimination. Identity-by-descent (IBD) is being used in Plasmodium population genomics to estimate genetic relatedness, effective population size ( N e ), population structure, and positive selection. However, a comprehensive evaluation of IBD segment detection tools is lacking for species with high rates of recombination. Here, we employ genetic simulations reflecting P. falciparum 's high recombination rate and decreasing N e to benchmark IBD callers, including probabilistic (hmmIBD, isoRelate), identity-by-state-based (hap-IBD, phased IBD) and others (Refined IBD), using genealogy-based true IBD and downstream inference of population characteristics. Our findings reveal that low marker density per genetic unit, related to high recombination rates relative to mutation rates, significantly affects the quality of detected IBD segments. Most IBD callers suffer from high false negative rates, which can be improved with parameter optimization. Optimized parameters allow for more accurate capture of selection signals and population structure, but hmmIBD is unique in providing less biased estimates of N e . Empirical data subsampled from the MalariaGEN Pf 7 database, representing different transmission settings, confirmed these patterns. We conclude that the detection of IBD in high-recombining species requires context-specific evaluation and parameter optimization and recommend that hmmIBD be used for quality-sensitive analysis, such as estimation of N e in these species.
Collapse
|
3
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning. Mol Biol Evol 2024; 41:msae077. [PMID: 38636507 PMCID: PMC11082913 DOI: 10.1093/molbev/msae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 04/08/2024] [Accepted: 04/12/2024] [Indexed: 04/20/2024] Open
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite-likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we present donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future genomic data summarized by an AFS. We demonstrate that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ 85721, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Connie K Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Travis J Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Ryan N Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
4
|
Eldon B, Stephan W. Sweepstakes reproduction facilitates rapid adaptation in highly fecund populations. Mol Ecol 2024; 33:e16903. [PMID: 36896794 DOI: 10.1111/mec.16903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 02/21/2023] [Accepted: 02/23/2023] [Indexed: 03/11/2023]
Abstract
Adaptation enables natural populations to survive in a changing environment. Understanding the mechanics of adaptation is therefore crucial for learning about the evolution and ecology of natural populations. We focus on the impact of random sweepstakes on selection in highly fecund haploid and diploid populations partitioned into two genetic types, with one type conferring selective advantage. For the diploid populations, we incorporate various dominance mechanisms. We assume that the populations may experience recurrent bottlenecks. In random sweepstakes, the distribution of individual recruitment success is highly skewed, resulting in a huge variance in the number of offspring contributed by the individuals present in any given generation. Using computer simulations, we investigate the joint effects of random sweepstakes, recurrent bottlenecks and dominance mechanisms on selection. In our framework, bottlenecks allow random sweepstakes to have an effect on the time to fixation, and in diploid populations, the effect of random sweepstakes depends on the dominance mechanism. We describe selective sweepstakes that are approximated by recurrent sweeps of strongly beneficial allelic types arising by mutation. We demonstrate that both types of sweepstakes reproduction may facilitate rapid adaptation (as defined based on the average time to fixation of a type conferring selective advantage conditioned on fixation of the type). However, whether random sweepstakes cause rapid adaptation depends also on their interactions with bottlenecks and dominance mechanisms. Finally, we review a case study in which a model of recurrent sweeps is shown to essentially explain population genomic data from Atlantic cod.
Collapse
Affiliation(s)
- Bjarki Eldon
- Institute of Evolution and Biodiversity Science, Natural History Museum Berlin, Berlin, Germany
| | | |
Collapse
|
5
|
DeHaas D, Pan Z, Wei X. Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data. bioRxiv 2024:2024.04.23.590800. [PMID: 38712040 PMCID: PMC11071416 DOI: 10.1101/2024.04.23.590800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), too large to fit into hard drives in uncompressed form. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging on ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a graph format compresses biobank-scale data to the point where it can fit in a typical server's RAM (5-26GB per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 2000 times smaller than the size of VCF. Moreover, the size of GRG increases sublinearly with the number of samples stored, making it a sustainable solution to the increasing number of samples in large datasets. We show that summaries of genetic variants can be computed on GRG via graph traversal that runs 230 times faster than on VCF. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.
Collapse
Affiliation(s)
- Drew DeHaas
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Ziqing Pan
- Department of Computational Biology, Cornell University, Ithaca, NY
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY
| |
Collapse
|
6
|
Aktürk Ş, Mapelli I, Güler MN, Gürün K, Katırcıoğlu B, Vural KB, Sağlıcan E, Çetin M, Yaka R, Sürer E, Atağ G, Çokoğlu SS, Sevkar A, Altınışık NE, Koptekin D, Somel M. Benchmarking kinship estimation tools for ancient genomes using pedigree simulations. Mol Ecol Resour 2024:e13960. [PMID: 38676702 DOI: 10.1111/1755-0998.13960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 03/19/2024] [Accepted: 03/28/2024] [Indexed: 04/29/2024]
Abstract
There is growing interest in uncovering genetic kinship patterns in past societies using low-coverage palaeogenomes. Here, we benchmark four tools for kinship estimation with such data: lcMLkin, NgsRelate, KIN, and READ, which differ in their input, IBD estimation methods, and statistical approaches. We used pedigree and ancient genome sequence simulations to evaluate these tools when only a limited number (1 to 50 K, with minor allele frequency ≥0.01) of shared SNPs are available. The performance of all four tools was comparable using ≥20 K SNPs. We found that first-degree related pairs can be accurately classified even with 1 K SNPs, with 85% F1 scores using READ and 96% using NgsRelate or lcMLkin. Distinguishing third-degree relatives from unrelated pairs or second-degree relatives was also possible with high accuracy (F1 > 90%) with 5 K SNPs using NgsRelate and lcMLkin, while READ and KIN showed lower success (69 and 79% respectively). Meanwhile, noise in population allele frequencies and inbreeding (first-cousin mating) led to deviations in kinship coefficients, with different sensitivities across tools. We conclude that using multiple tools in parallel might be an effective approach to achieve robust estimates on ultra-low-coverage genomes.
Collapse
Affiliation(s)
- Şevval Aktürk
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Igor Mapelli
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Merve N Güler
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kanat Gürün
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Büşra Katırcıoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Kıvılcım Başak Vural
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Ekin Sağlıcan
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Çetin
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Reyhan Yaka
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
- Centre for Palaeogenetics, Stockholm, Sweden
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Elif Sürer
- Department of Modeling and Simulation, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Gözde Atağ
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Sevim Seda Çokoğlu
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Arda Sevkar
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - N Ezgi Altınışık
- Department of Anthropology, Hacettepe University, Ankara, Turkey
| | - Dilek Koptekin
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Mehmet Somel
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
7
|
Sommer-Trembo C, Santos ME, Clark B, Werner M, Fages A, Matschiner M, Hornung S, Ronco F, Oliver C, Garcia C, Tschopp P, Malinsky M, Salzburger W. The genetics of niche-specific behavioral tendencies in an adaptive radiation of cichlid fishes. Science 2024; 384:470-475. [PMID: 38662824 DOI: 10.1126/science.adj9228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 03/12/2024] [Indexed: 05/03/2024]
Abstract
Behavior is critical for animal survival and reproduction, and possibly for diversification and evolutionary radiation. However, the genetics behind adaptive variation in behavior are poorly understood. In this work, we examined a fundamental and widespread behavioral trait, exploratory behavior, in one of the largest adaptive radiations on Earth, the cichlid fishes of Lake Tanganyika. By integrating quantitative behavioral data from 57 cichlid species (702 wild-caught individuals) with high-resolution ecomorphological and genomic information, we show that exploratory behavior is linked to macrohabitat niche adaptations in Tanganyikan cichlids. Furthermore, we uncovered a correlation between the genotypes at a single-nucleotide polymorphism upstream of the AMPA glutamate-receptor regulatory gene cacng5b and variation in exploratory tendency. We validated this association using behavioral predictions with a neural network approach and CRISPR-Cas9 genome editing.
Collapse
Affiliation(s)
- Carolin Sommer-Trembo
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - M Emília Santos
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Bethan Clark
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Marco Werner
- Leibniz-Institute for Polymer Research Dresden, Dresden, Germany
| | - Antoine Fages
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | | | - Simon Hornung
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Fabrizia Ronco
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
- Natural History Museum, University of Oslo, Oslo, Norway
| | - Chantal Oliver
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Cody Garcia
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Patrick Tschopp
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Milan Malinsky
- Department of Biology, Institute of Ecology and Evolution, University of Bern, Bern, Switzerland
| | - Walter Salzburger
- Zoological Institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| |
Collapse
|
8
|
Guyon L, Guez J, Toupance B, Heyer E, Chaix R. Patrilineal segmentary systems provide a peaceful explanation for the post-Neolithic Y-chromosome bottleneck. Nat Commun 2024; 15:3243. [PMID: 38658560 PMCID: PMC11043392 DOI: 10.1038/s41467-024-47618-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 04/08/2024] [Indexed: 04/26/2024] Open
Abstract
Studies have found a pronounced decline in male effective population sizes worldwide around 3000-5000 years ago. This bottleneck was not observed for female effective population sizes, which continued to increase over time. Until now, this remarkable genetic pattern was interpreted as the result of an ancient structuring of human populations into patrilineal groups (gathering closely related males) violently competing with each other. In this scenario, violence is responsible for the repeated extinctions of patrilineal groups, leading to a significant reduction in male effective population size. Here, we propose an alternative hypothesis by modelling a segmentary patrilineal system based on anthropological literature. We show that variance in reproductive success between patrilineal groups, combined with lineal fission (i.e., the splitting of a group into two new groups of patrilineally related individuals), can lead to a substantial reduction in the male effective population size without resorting to the violence hypothesis. Thus, a peaceful explanation involving ancient changes in social structures, linked to global changes in subsistence systems, may be sufficient to explain the reported decline in Y-chromosome diversity.
Collapse
Affiliation(s)
- Léa Guyon
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France.
| | - Jérémy Guez
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
- Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, 91400, France
| | - Bruno Toupance
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
- Université Paris Cité, Eco-anthropologie, Paris, F-75006, France
| | - Evelyne Heyer
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France
| | - Raphaëlle Chaix
- Eco-Anthropologie (UMR 7206), Muséum National d'Histoire Naturelle, CNRS, Université Paris Cité, Paris, 75116, France.
| |
Collapse
|
9
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. bioRxiv 2024:2023.11.03.565466. [PMID: 37961279 PMCID: PMC10635123 DOI: 10.1101/2023.11.03.565466] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalises these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, UK
- Department of Statistics, University of Oxford, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, UK
- Department of Statistics, University of Warwick, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Anthony W. Wohns
- Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
10
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. bioRxiv 2024:2024.04.07.588318. [PMID: 38645049 PMCID: PMC11030438 DOI: 10.1101/2024.04.07.588318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q , and compared the deviation of key outcomes (fixation times, fixation probabilities, allele frequencies, and linkage disequilibrium) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q . Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q . In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling effect's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q .
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
11
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
12
|
Rivas-González I, Tung J. A multi-million-year natural experiment: Comparative genomics on a massive scale and its implications for human health. Evol Med Public Health 2024; 12:67-70. [PMID: 38601345 PMCID: PMC11005778 DOI: 10.1093/emph/eoae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 03/18/2024] [Indexed: 04/12/2024] Open
Abstract
Improving the diversity and quality of genome assemblies for non-human mammals has been a long-standing goal of comparative genomics. The last year saw substantial progress towards this goal, including the release of genome alignments for 240 mammals and nearly half the primate order. These resources have increased our ability to identify evolutionarily constrained regions of the genome, and together strongly support the importance of these regions to biomedically relevant trait variation in humans. They also provide new strategies for identifying the genetic basis of changes unique to individual lineages, illustrating the value of evolutionary comparative approaches for understanding human health.
Collapse
Affiliation(s)
- Iker Rivas-González
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Jenny Tung
- Department of Primate Behavior and Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Department of Evolutionary Anthropology, Duke University, Durham, NC, USA
- Department of Biology, Duke University, Durham, NC, USA
- Faculty of Life Sciences, Institute of Biology, Leipzig University, Leipzig, Germany
| |
Collapse
|
13
|
Riley R, Mathieson I, Mathieson S. Interpreting generative adversarial networks to infer natural selection from genetic data. Genetics 2024; 226:iyae024. [PMID: 38386895 PMCID: PMC10990424 DOI: 10.1093/genetics/iyae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 01/15/2024] [Accepted: 01/19/2024] [Indexed: 02/24/2024] Open
Abstract
Understanding natural selection and other forms of non-neutrality is a major focus for the use of machine learning in population genetics. Existing methods rely on computationally intensive simulated training data. Unlike efficient neutral coalescent simulations for demographic inference, realistic simulations of selection typically require slow forward simulations. Because there are many possible modes of selection, a high dimensional parameter space must be explored, with no guarantee that the simulated models are close to the real processes. Finally, it is difficult to interpret trained neural networks, leading to a lack of understanding about what features contribute to classification. Here we develop a new approach to detect selection and other local evolutionary processes that requires relatively few selection simulations during training. We build upon a generative adversarial network trained to simulate realistic neutral data. This consists of a generator (fitted demographic model), and a discriminator (convolutional neural network) that predicts whether a genomic region is real or fake. As the generator can only generate data under neutral demographic processes, regions of real data that the discriminator recognizes as having a high probability of being "real" do not fit the neutral demographic model and are therefore candidates for targets of selection. To incentivize identification of a specific mode of selection, we fine-tune the discriminator with a small number of custom non-neutral simulations. We show that this approach has high power to detect various forms of selection in simulations, and that it finds regions under positive selection identified by state-of-the-art population genetic methods in three human populations. Finally, we show how to interpret the trained networks by clustering hidden units of the discriminator based on their correlation patterns with known summary statistics.
Collapse
Affiliation(s)
- Rebecca Riley
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| | - Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sara Mathieson
- Department of Computer Science, Haverford College, Haverford, PA 19041, USA
| |
Collapse
|
14
|
Johnson OL, Tobler R, Schmidt JM, Huber CD. Population genetic simulation: Benchmarking frameworks for non-standard models of natural selection. Mol Ecol Resour 2024; 24:e13930. [PMID: 38247258 DOI: 10.1111/1755-0998.13930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/21/2023] [Accepted: 01/09/2024] [Indexed: 01/23/2024]
Abstract
Population genetic simulation has emerged as a common tool for investigating increasingly complex evolutionary and demographic models. Software capable of handling high-level model complexity has recently been developed, and the advancement of tree sequence recording now allows simulations to merge the efficiency and genealogical insight of coalescent simulations with the flexibility of forward simulations. However, frameworks utilizing these features have not yet been compared and benchmarked. Here, we evaluate various simulation workflows using the coalescent simulator msprime and the forward simulator SLiM, to assess resource efficiency and determine an optimal simulation framework. Three aspects were evaluated: (1) the burn-in, to establish an equilibrium level of neutral diversity in the population; (2) the forward simulation, in which temporally fluctuating selection is acting; and (3) the final computation of summary statistics. We provide typical memory and computation time requirements for each step. We find that the fastest framework, a combination of coalescent and forward simulation with tree sequence recording, increases simulation speed by over twenty times compared to classical forward simulations without tree sequence recording, although it does require six times more memory. Overall, using efficient simulation workflows can lead to a substantial improvement when modelling complex evolutionary scenarios-although the optimal framework ultimately depends on the available computational resources.
Collapse
Affiliation(s)
- Olivia L Johnson
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
| | - Raymond Tobler
- Evolution of Cultural Diversity Initiative, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Joshua M Schmidt
- Department of Ophthalmology, College of Medicine and Public Health, Flinders University, Adelaide, South Australia, Australia
| | - Christian D Huber
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
15
|
Clark MI, Fitzpatrick SW, Bradburd GS. Pitfalls and windfalls of detecting demographic declines using population genetics in long-lived species. bioRxiv 2024:2024.03.27.586886. [PMID: 38585961 PMCID: PMC10996660 DOI: 10.1101/2024.03.27.586886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Detecting recent demographic changes is a crucial component of species conservation and management, as many natural populations face declines due to anthropogenic habitat alteration and climate change. Genetic methods allow researchers to detect changes in effective population size (N e ) from sampling at a single timepoint. However, in species with long lifespans, there is a lag between the start of a decline in a population and the resulting decrease in genetic diversity. This lag slows the rate at which diversity is lost, and therefore makes it difficult to detect recent declines using genetic data. However, the genomes of old individuals can provide a window into the past, and can be compared to those of younger individuals, a contrast that may help reveal recent demographic declines. To test whether comparing the genomes of young and old individuals can help infer recent demographic bottlenecks, we use forward-time, individual-based simulations with varying mean individual lifespans and extents of generational overlap. We find that age information can be used to aid in the detection of demographic declines when the decline has been severe. When average lifespan is long, comparing young and old individuals from a single timepoint has greater power to detect a recent (within the last 50 years) bottleneck event than comparing individuals sampled at different points in time. Our results demonstrate how longevity and generational overlap can be both a hindrance and a boon to detecting recent demographic declines from population genomic data.
Collapse
|
16
|
Guardado M, Perez C, Jackson S, Magaña J, Campana S, Samperio E, Rojas BC, Hernandez S, Syas K, Hernandez R, Zavala EI, Rohlfs R. py_ped_sim - A flexible forward genetic simulator for complex family pedigree analysis. bioRxiv 2024:2024.03.25.586501. [PMID: 38585824 PMCID: PMC10996500 DOI: 10.1101/2024.03.25.586501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Background Large-scale family pedigrees are commonly used across medical, evolutionary, and forensic genetics. These pedigrees are tools for identifying genetic disorders, tracking evolutionary patterns, and establishing familial relationships via forensic genetic identification. However, there is a lack of software to accurately simulate different pedigree structures along with genomes corresponding to those individuals in a family pedigree. This limits simulation-based evaluations of methods that use pedigrees. Results We have developed a python command-line-based tool called py_ped_sim that facilitates the simulation of pedigree structures and the genomes of individuals in a pedigree. py_ped_sim represents pedigrees as directed acyclic graphs, enabling conversion between standard pedigree formats and integration with the forward population genetic simulator, SLiM. Notably, py_ped_sim allows the simulation of varying numbers of offspring for a set of parents, with the capacity to shift the distribution of sibship sizes over generations. We additionally add simulations for events of misattributed paternity, which offers a way to simulate half-sibling relationships. We validated the accuracy of our software by simulating genomes onto diverse family pedigree structures, showing that the estimated kinship coefficients closely approximated expected values. Conclusions py_ped_sim is a user-friendly and open-source solution for simulating pedigree structures and conducting pedigree genome simulations. It empowers medical, forensic, and evolutionary genetics researchers to gain deeper insights into the dynamics of genetic inheritance and relatedness within families.
Collapse
Affiliation(s)
- Miguel Guardado
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
- University of California San Francisco, Biological and Medical Informatics Graduate Program. San Francisco CA, 94158
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| | - Cynthia Perez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Shalom Jackson
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Joaquín Magaña
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Sthen Campana
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Emily Samperio
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | | | - Selena Hernandez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Kaela Syas
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
| | - Ryan Hernandez
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
| | - Elena I. Zavala
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA, 94720, USA
| | - Rori Rohlfs
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| |
Collapse
|
17
|
Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, Silva JC, Waters NC, O'Connor TD, Takala-Harrison S. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun 2024; 15:2499. [PMID: 38509066 PMCID: PMC10954658 DOI: 10.1038/s41467-024-46659-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
Malaria genomic surveillance often estimates parasite genetic relatedness using metrics such as Identity-By-Decent (IBD), yet strong positive selection stemming from antimalarial drug resistance or other interventions may bias IBD-based estimates. In this study, we use simulations, a true IBD inference algorithm, and empirical data sets from different malaria transmission settings to investigate the extent of this bias and explore potential correction strategies. We analyze whole genome sequence data generated from 640 new and 3089 publicly available Plasmodium falciparum clinical isolates. We demonstrate that positive selection distorts IBD distributions, leading to underestimated effective population size and blurred population structure. Additionally, we discover that the removal of IBD peak regions partially restores the accuracy of IBD-based inferences, with this effect contingent on the population's background genetic relatedness and extent of inbreeding. Consequently, we advocate for selection correction for parasite populations undergoing strong, recent positive selection, particularly in high malaria transmission settings.
Collapse
Affiliation(s)
- Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Victor Borda
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Roland Laboulaye
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Michele D Spring
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Mariusz Wojnarski
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Brian A Vesely
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
- Global Health and Tropical Medicine (GHTM), Instituto de Higiene e Medicina Tropical (IHMT), Universidade NOVA de Lisboa (NOVA), Lisbon, Portugal
| | - Norman C Waters
- Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand
| | - Timothy D O'Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
| | - Shannon Takala-Harrison
- Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
18
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. bioRxiv 2024:2024.03.15.585300. [PMID: 38559192 PMCID: PMC10980082 DOI: 10.1101/2024.03.15.585300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and source sink dynamics of dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity by descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology, and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Collapse
Affiliation(s)
- Chris C. R. Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Gilia Patterson
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Peter L. Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Andrew D. Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| |
Collapse
|
19
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. bioRxiv 2024:2024.03.13.584790. [PMID: 38559118 PMCID: PMC10980058 DOI: 10.1101/2024.03.13.584790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Summary Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact daiki.tagami@hertford.ox.ac.uk.
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
20
|
Huang Z, Kelleher J, Chan YB, Balding DJ. Estimating evolutionary and demographic parameters via ARG-derived IBD. bioRxiv 2024:2024.03.07.583855. [PMID: 38559261 PMCID: PMC10979897 DOI: 10.1101/2024.03.07.583855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Inference of demographic and evolutionary parameters from a sample of genome sequences often proceeds by first inferring identical-by-descent (IBD) genome segments. By exploiting efficient data encoding based on the ancestral recombination graph (ARG), we obtain three major advantages over current approaches: (i) no need to impose a length threshold on IBD segments, (ii) IBD can be defined without the hard-to-verify requirement of no recombination, and (iii) computation time can be reduced with little loss of statistical efficiency using only the IBD segments from a set of sequence pairs that scales linearly with sample size. We first demonstrate powerful inferences when true IBD information is available from simulated data. For IBD inferred from real data, we propose an approximate Bayesian computation inference algorithm and use it to show that poorly-inferred short IBD segments can improve estimation precision. We show estimation precision similar to a previously-published estimator despite a 4 000-fold reduction in data used for inference. Computational cost limits model complexity in our approach, but we are able to incorporate unknown nuisance parameters and model misspecification, still finding improved parameter inference.
Collapse
Affiliation(s)
- Zhendong Huang
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - Jerome Kelleher
- Oxford Big Data Institute, University of Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| | - David J. Balding
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Australia
| |
Collapse
|
21
|
Kent TV, Schrider DR, Matute DR. Demographic history and the efficacy of selection in the globally invasive mosquito Aedes aegypti. bioRxiv 2024:2024.03.07.584008. [PMID: 38559089 PMCID: PMC10979846 DOI: 10.1101/2024.03.07.584008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Aedes aegypti is the main vector species of yellow fever, dengue, zika and chikungunya. The species is originally from Africa but has experienced a spectacular expansion in its geographic range to a large swath of the world, the demographic effects of which have remained largely understudied. In this report, we examine whole-genome sequences from 6 countries in Africa, North America, and South America to investigate the demographic history of the spread of Ae. aegypti into the Americas its impact on genomic diversity. In the Americas, we observe patterns of strong population structure consistent with relatively low (but probably non-zero) levels of gene flow but occasional long-range dispersal and/or recolonization events. We also find evidence that the colonization of the Americas has resulted in introduction bottlenecks. However, while each sampling location shows evidence of a past population contraction and subsequent recovery, our results suggest that the bottlenecks in America have led to a reduction in genetic diversity of only ~35% relative to African populations, and the American samples have retained high levels of genetic diversity (expected heterozygosity of ~0.02 at synonymous sites) and have experienced only a minor reduction in the efficacy of selection. These results evoke the image of an invasive species that has expanded its range with remarkable genetic resilience in the face of strong eradication pressure.
Collapse
Affiliation(s)
- Tyler V. Kent
- Department of Biology, College of Arts and Sciences, University of North Carolina, Chapel Hill, NC, USA
- Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Daniel R. Schrider
- Department of Genetics, School of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Daniel R. Matute
- Department of Biology, College of Arts and Sciences, University of North Carolina, Chapel Hill, NC, USA
| |
Collapse
|
22
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. bioRxiv 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
23
|
Simon A, Coop G. The contribution of gene flow, selection, and genetic drift to five thousand years of human allele frequency change. Proc Natl Acad Sci U S A 2024; 121:e2312377121. [PMID: 38363870 PMCID: PMC10907250 DOI: 10.1073/pnas.2312377121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 01/09/2024] [Indexed: 02/18/2024] Open
Abstract
Genomic time series from experimental evolution studies and ancient DNA datasets offer us a chance to directly observe the interplay of various evolutionary forces. We show how the genome-wide variance in allele frequency change between two time points can be decomposed into the contributions of gene flow, genetic drift, and linked selection. In closed populations, the contribution of linked selection is identifiable because it creates covariances between time intervals, and genetic drift does not. However, repeated gene flow between populations can also produce directionality in allele frequency change, creating covariances. We show how to accurately separate the fraction of variance in allele frequency change due to admixture and linked selection in a population receiving gene flow. We use two human ancient DNA datasets, spanning around 5,000 y, as time transects to quantify the contributions to the genome-wide variance in allele frequency change. We find that a large fraction of genome-wide change is due to gene flow. In both cases, after correcting for known major gene flow events, we do not observe a signal of genome-wide linked selection. Thus despite the known role of selection in shaping long-term polymorphism levels, and an increasing number of examples of strong selection on single loci and polygenic scores from ancient DNA, it appears to be gene flow and drift, and not selection, that are the main determinants of recent genome-wide allele frequency change. Our approach should be applicable to the growing number of contemporary and ancient temporal population genomics datasets.
Collapse
Affiliation(s)
- Alexis Simon
- Center for Population Biology, University of California, Davis, CA95616
- Department of Evolution and Ecology, University of California, Davis, CA95616
| | - Graham Coop
- Center for Population Biology, University of California, Davis, CA95616
- Department of Evolution and Ecology, University of California, Davis, CA95616
| |
Collapse
|
24
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally efficient demographic history inference from allele frequencies with supervised machine learning. bioRxiv 2024:2023.05.24.542158. [PMID: 38405827 PMCID: PMC10888863 DOI: 10.1101/2023.05.24.542158] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we developed donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future input data AFS. We demonstrated that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N. Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Connie K. Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Travis J. Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Ryan N. Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
25
|
Nunez JCB, Lenhart BA, Bangerter A, Murray CS, Mazzeo GR, Yu Y, Nystrom TL, Tern C, Erickson PA, Bergland AO. A cosmopolitan inversion facilitates seasonal adaptation in overwintering Drosophila. Genetics 2024; 226:iyad207. [PMID: 38051996 PMCID: PMC10847723 DOI: 10.1093/genetics/iyad207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Accepted: 11/28/2023] [Indexed: 12/07/2023] Open
Abstract
Fluctuations in the strength and direction of natural selection through time are a ubiquitous feature of life on Earth. One evolutionary outcome of such fluctuations is adaptive tracking, wherein populations rapidly adapt from standing genetic variation. In certain circumstances, adaptive tracking can lead to the long-term maintenance of functional polymorphism despite allele frequency change due to selection. Although adaptive tracking is likely a common process, we still have a limited understanding of aspects of its genetic architecture and its strength relative to other evolutionary forces such as drift. Drosophila melanogaster living in temperate regions evolve to track seasonal fluctuations and are an excellent system to tackle these gaps in knowledge. By sequencing orchard populations collected across multiple years, we characterized the genomic signal of seasonal demography and identified that the cosmopolitan inversion In(2L)t facilitates seasonal adaptive tracking and shows molecular footprints of selection. A meta-analysis of phenotypic studies shows that seasonal loci within In(2L)t are associated with behavior, life history, physiology, and morphological traits. We identify candidate loci and experimentally link them to phenotype. Our work contributes to our general understanding of fluctuating selection and highlights the evolutionary outcome and dynamics of contemporary selection on inversions.
Collapse
Affiliation(s)
- Joaquin C B Nunez
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
- Department of Biology, University of Vermont, 109 Carrigan Drive, Burlington, VT 05405, USA
| | - Benedict A Lenhart
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Alyssa Bangerter
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Connor S Murray
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Giovanni R Mazzeo
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Yang Yu
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Taylor L Nystrom
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Courtney Tern
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| | - Priscilla A Erickson
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
- Department of Biology, University of Richmond, 138 UR Drive, Richmond, VA 23173, USA
| | - Alan O Bergland
- Department of Biology, University of Virginia, 90 Geldard Drive, Charlottesville, VA 22901, USA
| |
Collapse
|
26
|
Rivas-González I, Schierup MH, Wakeley J, Hobolth A. TRAILS: Tree reconstruction of ancestry using incomplete lineage sorting. PLoS Genet 2024; 20:e1010836. [PMID: 38330138 PMCID: PMC10880969 DOI: 10.1371/journal.pgen.1010836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 02/21/2024] [Accepted: 01/22/2024] [Indexed: 02/10/2024] Open
Abstract
Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.
Collapse
Affiliation(s)
| | - Mikkel H. Schierup
- Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, Denmark
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Massachusetts, United States of America
| | - Asger Hobolth
- Department of Mathematics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
27
|
van der Valk T, Jensen A, Caillaud D, Guschanski K. Comparative genomic analyses provide new insights into evolutionary history and conservation genomics of gorillas. BMC Ecol Evol 2024; 24:14. [PMID: 38273244 PMCID: PMC10811819 DOI: 10.1186/s12862-023-02195-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 12/22/2023] [Indexed: 01/27/2024] Open
Abstract
Genome sequencing is a powerful tool to understand species evolutionary history, uncover genes under selection, which could be informative of local adaptation, and infer measures of genetic diversity, inbreeding and mutational load that could be used to inform conservation efforts. Gorillas, critically endangered primates, have received considerable attention and with the recently sequenced Bwindi mountain gorilla population, genomic data is now available from all gorilla subspecies and both mountain gorilla populations. Here, we reanalysed this rich dataset with a focus on evolutionary history, local adaptation and genomic parameters relevant for conservation. We estimate a recent split between western and eastern gorillas of 150,000-180,000 years ago, with gene flow around 20,000 years ago, primarily between the Cross River and Grauer's gorilla subspecies. This gene flow event likely obscures evolutionary relationships within eastern gorillas: after excluding putatively introgressed genomic regions, we uncover a sister relationship between Virunga mountain gorillas and Grauer's gorillas to the exclusion of Bwindi mountain gorillas. This makes mountain gorillas paraphyletic. Eastern gorillas are less genetically diverse and more inbred than western gorillas, yet we detected lower genetic load in the eastern species. Analyses of indels fit remarkably well with differences in genetic diversity across gorilla taxa as recovered with nucleotide diversity measures. We also identified genes under selection and unique gene variants specific for each gorilla subspecies, encoding, among others, traits involved in immunity, diet, muscular development, hair morphology and behavior. The presence of this functional variation suggests that the subspecies may be locally adapted. In conclusion, using extensive genomic resources we provide a comprehensive overview of gorilla genomic diversity, including a so-far understudied Bwindi mountain gorilla population, identify putative genes involved in local adaptation, and detect population-specific gene flow across gorilla species.
Collapse
Affiliation(s)
- Tom van der Valk
- Centre for Palaeogenetics, Stockholm, Sweden.
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden.
- SciLifeLab, Stockholm, Sweden.
- Department of Zoology, Stockholm University, Stockholm, Sweden.
| | - Axel Jensen
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala, Sweden
| | - Damien Caillaud
- Department of Anthropology, University of CA - Davis, Davis, California, USA
| | - Katerina Guschanski
- SciLifeLab, Stockholm, Sweden
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala, Sweden
- Institute of Ecology and Evolution, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
28
|
Simon A, Coop G. The contribution of gene flow, selection, and genetic drift to five thousand years of human allele frequency change. bioRxiv 2024:2023.07.11.548607. [PMID: 37503227 PMCID: PMC10370008 DOI: 10.1101/2023.07.11.548607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Genomic time series from experimental evolution studies and ancient DNA datasets offer us a chance to directly observe the interplay of various evolutionary forces. We show how the genome-wide variance in allele frequency change between two time points can be decomposed into the contributions of gene flow, genetic drift, and linked selection. In closed populations, the contribution of linked selection is identifiable because it creates covariances between time intervals, and genetic drift does not. However, repeated gene flow between populations can also produce directionality in allele frequency change, creating covariances. We show how to accurately separate the fraction of variance in allele frequency change due to admixture and linked selection in a population receiving gene flow. We use two human ancient DNA datasets, spanning around 5,000 years, as time transects to quantify the contributions to the genome-wide variance in allele frequency change. We find that a large fraction of genome-wide change is due to gene flow. In both cases, after correcting for known major gene flow events, we do not observe a signal of genome-wide linked selection. Thus despite the known role of selection in shaping long-term polymorphism levels, and an increasing number of examples of strong selection on single loci and polygenic scores from ancient DNA, it appears to be gene flow and drift, and not selection, that are the main determinants of recent genome-wide allele frequency change. Our approach should be applicable to the growing number of contemporary and ancient temporal population genomics datasets.
Collapse
Affiliation(s)
- Alexis Simon
- Center for Population Biology, University of California, Davis, CA 95616
- Department of Evolution and Ecology, University of California, Davis, CA 95616
| | - Graham Coop
- Center for Population Biology, University of California, Davis, CA 95616
- Department of Evolution and Ecology, University of California, Davis, CA 95616
| |
Collapse
|
29
|
Zhang Y, Zhang H, Wu Y. A general approach for inferring the ancestry of recent ancestors of an admixed individual. Proc Natl Acad Sci U S A 2024; 121:e2316242120. [PMID: 38165936 PMCID: PMC10786287 DOI: 10.1073/pnas.2316242120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 11/27/2023] [Indexed: 01/04/2024] Open
Abstract
The genome of an individual from an admixed population consists of segments originated from different ancestral populations. Most existing ancestry inference approaches focus on calling these segments for the extant individual. In this paper, we present a general ancestry inference approach for inferring recent ancestors from an extant genome. Given the genome of an individual from a recently admixed population, our method can estimate the proportions of the genomes of the recent ancestors of this individual that originated from some ancestral populations. The key step of our method is the inference of ancestors (called founders) right after the formation of an admixed population. The inferred founders can then be used to infer the ancestry of recent ancestors of an extant individual. Our method is implemented in a computer program called PedMix2. To the best of our knowledge, there is no existing method that can practically infer ancestors beyond grandparents from an extant individual's genome. Results on both simulated and real data show that PedMix2 performs well in ancestry inference.
Collapse
Affiliation(s)
- Yiming Zhang
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| | - Haotian Zhang
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| | - Yufeng Wu
- School of Computing, College of Engineering, University of Connecticut, Storrs, CT06269
| |
Collapse
|
30
|
Stankowski S, Zagrodzka ZB, Garlovsky MD, Pal A, Shipilina D, Castillo DG, Lifchitz H, Le Moan A, Leder E, Reeve J, Johannesson K, Westram AM, Butlin RK. The genetic basis of a recent transition to live-bearing in marine snails. Science 2024; 383:114-119. [PMID: 38175895 DOI: 10.1126/science.adi2982] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 10/25/2023] [Indexed: 01/06/2024]
Abstract
Key innovations are fundamental to biological diversification, but their genetic basis is poorly understood. A recent transition from egg-laying to live-bearing in marine snails (Littorina spp.) provides the opportunity to study the genetic architecture of an innovation that has evolved repeatedly across animals. Individuals do not cluster by reproductive mode in a genome-wide phylogeny, but local genealogical analysis revealed numerous small genomic regions where all live-bearers carry the same core haplotype. Candidate regions show evidence for live-bearer-specific positive selection and are enriched for genes that are differentially expressed between egg-laying and live-bearing reproductive systems. Ages of selective sweeps suggest that live-bearer-specific alleles accumulated over more than 200,000 generations. Our results suggest that new functions evolve through the recruitment of many alleles rather than in a single evolutionary step.
Collapse
Affiliation(s)
- Sean Stankowski
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Department of Ecology and Evolution, University of Sussex, Brighton BN1 9RH, UK
| | - Zuzanna B Zagrodzka
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Martin D Garlovsky
- Department of Applied Zoology, Faculty of Biology, Technische Universität Dresden, 01069 Dresden, Germany
| | - Arka Pal
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
| | - Daria Shipilina
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Department of Ecology and Genetics, Program of Evolutionary Biology, Uppsala University, SE-752 36 Uppsala, Sweden
| | | | - Hila Lifchitz
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
| | - Alan Le Moan
- CNRS and Sorbonne Université, Station Biologique de Roscoff, 29680 Roscoff, France
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Erica Leder
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
- Natural History Museum, University of Oslo, 0562 Oslo, Norway
| | - James Reeve
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Kerstin Johannesson
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| | - Anja M Westram
- Institute of Science and Technology Austria (ISTA), 3400 Klosterneuburg, Austria
- Faculty of Biosciences and Aquaculture, Nord University, N-8049 Bodø, Norway
| | - Roger K Butlin
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield S10 2TN, UK
- Department of Marine Sciences, Tjärnö Marine Laboratory, University of Gothenburg, 452 96 Strömstad, Sweden
| |
Collapse
|
31
|
Benham PM, Walsh J, Bowie RCK. Spatial variation in population genomic responses to over a century of anthropogenic change within a tidal marsh songbird. Glob Chang Biol 2024; 30:e17126. [PMID: 38273486 DOI: 10.1111/gcb.17126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 11/22/2023] [Accepted: 12/13/2023] [Indexed: 01/27/2024]
Abstract
Combating the current biodiversity crisis requires the accurate documentation of population responses to human-induced ecological change. However, our ability to pinpoint population responses to human activities is often limited to the analysis of populations studied well after the fact. Museum collections preserve a record of population responses to anthropogenic change that can provide critical baseline data on patterns of genetic diversity, connectivity, and population structure prior to the onset of human perturbation. Here, we leverage a spatially replicated time series of specimens to document population genomic responses to the destruction of nearly 90% of coastal habitats occupied by the Savannah sparrow (Passerculus sandwichensis) in California. We sequenced 219 sparrows collected from 1889 to 2017 across the state of California using an exome capture approach. Spatial-temporal analyses of genetic diversity found that the amount of habitat lost was not predictive of genetic diversity loss. Sparrow populations from southern California historically exhibited lower levels of genetic diversity and experienced the most significant temporal declines in genetic diversity. Despite experiencing the greatest levels of habitat loss, we found that genetic diversity in the San Francisco Bay area remained relatively high. This was potentially related to an observed increase in gene flow into the Bay Area from other populations. While gene flow may have minimized genetic diversity declines, we also found that immigration from inland freshwater-adapted populations into tidal marsh populations led to the erosion of divergence at loci associated with tidal marsh adaptation. Shifting patterns of gene flow through time in response to habitat loss may thus contribute to negative fitness consequences and outbreeding depression. Together, our results underscore the importance of tracing the genomic trajectories of multiple populations over time to address issues of fundamental conservation concern.
Collapse
Affiliation(s)
- Phred M Benham
- Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, California, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA
| | - Jennifer Walsh
- Fuller Evolutionary Biology Program, Cornell Lab of Ornithology, Cornell University, Ithaca, New York, USA
| | - Rauri C K Bowie
- Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, California, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA
| |
Collapse
|
32
|
Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genet 2024; 20:e1011110. [PMID: 38236805 PMCID: PMC10796009 DOI: 10.1371/journal.pgen.1011110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2024] Open
Abstract
In the presence of recombination, the evolutionary relationships between a set of sampled genomes cannot be described by a single genealogical tree. Instead, the genomes are related by a complex, interwoven collection of genealogies formalized in a structure called an ancestral recombination graph (ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is replete with valuable information for addressing diverse questions in evolutionary biology. Despite its potential utility, technological and methodological limitations, along with a lack of approachable literature, have severely restricted awareness and application of ARGs in evolution research. Excitingly, recent progress in ARG reconstruction and simulation have made ARG-based approaches feasible for many questions and systems. In this review, we provide an accessible introduction and exploration of ARGs, survey recent methodological breakthroughs, and describe the potential for ARGs to further existing goals and open avenues of inquiry that were previously inaccessible in evolutionary genomics. Through this discussion, we aim to more widely disseminate the promise of ARGs in evolutionary genomics and encourage the broader development and adoption of ARG-based inference.
Collapse
Affiliation(s)
- Alexander L. Lewanski
- Department of Integrative Biology, Michigan State University, East Lansing, Michigan, United States of America
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Michael C. Grundler
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gideon S. Bradburd
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
33
|
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
34
|
Mendes FK, Landis MJ. PhyloJunction: a computational framework for simulating, developing, and teaching evolutionary models. bioRxiv 2023:2023.12.15.571907. [PMID: 38168278 PMCID: PMC10760140 DOI: 10.1101/2023.12.15.571907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
We introduce PhyloJunction, a computational framework designed to facilitate the prototyping, testing, and characterization of evolutionary models. PhyloJunction is distributed as an open-source Python library that can be used to implement a variety of models, through its flexible graphical modeling architecture and dedicated model specification language. Model design and use are exposed to users via command-line and graphical interfaces, which integrate the steps of simulating, summarizing, and visualizing data. This paper describes the features of PhyloJunction - which include, but are not limited to, a general implementation of a popular family of phylogenetic diversification models - and, moving forward, how it may be expanded to not only include new models, but to also become a platform for conducting and teaching statistical learning.
Collapse
Affiliation(s)
- Fábio K. Mendes
- Department of Biology, Washington University in St. Louis, St. Louis, MO
| | - Michael J. Landis
- Department of Biology, Washington University in St. Louis, St. Louis, MO
| |
Collapse
|
35
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CWK, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023; 110:2077-2091. [PMID: 38065072 PMCID: PMC10716520 DOI: 10.1016/j.ajhg.2023.10.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 12/18/2023] Open
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Joshua G Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
36
|
Kim K, Kim D, Hanotte O, Lee C, Kim H, Jeong C. Inference of Admixture Origins in Indigenous African Cattle. Mol Biol Evol 2023; 40:msad257. [PMID: 37995300 PMCID: PMC10701095 DOI: 10.1093/molbev/msad257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Revised: 10/12/2023] [Accepted: 11/14/2023] [Indexed: 11/25/2023] Open
Abstract
Present-day African cattle retain a unique genetic profile composed of a mixture of the Bos taurus and Bos indicus populations introduced into the continent at different time periods. However, details of the admixture history and the exact origins of the source populations remain obscure. Here, we infer the source of admixture in the earliest domestic cattle in Africa, African taurine. We detect a significant contribution (up to ∼20%) from a basal taurine lineage, which might represent the now-extinct African aurochs. In addition, we show that the indicine ancestry of African cattle, although most closely related to so-far sampled North Indian indicine breeds, has a small amount of additional genetic affinity to Southeast Asian indicine breeds. Our findings support the hypothesis of aurochs introgression into African taurine and generate a novel hypothesis that the origin of indicine ancestry in Africa might be different indicine populations than the ones found in North India today.
Collapse
Affiliation(s)
- Kwondo Kim
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
| | - Donghee Kim
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Olivier Hanotte
- LiveGene, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia
- The Centre for Tropical Livestock Genetics and Health (CTLGH), The Roslin Institute, The University of Edinburgh, Midlothian, UK
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | - Charles Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Heebal Kim
- Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- eGnome, Inc., Seoul, Republic of Korea
| | - Choongwon Jeong
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
37
|
Otto M, Wiehe T. The structured coalescent in the context of gene copy number variation. Theor Popul Biol 2023; 154:67-78. [PMID: 37657649 DOI: 10.1016/j.tpb.2023.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 08/16/2023] [Accepted: 08/22/2023] [Indexed: 09/03/2023]
Abstract
The Structured Coalescent was introduced to describe the coalescent process in spatially subdivided populations with migration. Here, we re-interpret migration routes of individuals in the original model as "migration routes" of single genes in tandemly arranged gene arrays. A gene copy may change its position within the array via unequal recombination. Hence, in a coalescent framework, two copies sampled from two chromosomes may coalesce only if they are at exactly homologous positions. Otherwise, one or multiple recombination events have to occur before they can coalesce, thereby increasing mean coalescence time and expected genetic diversity among the copies in a gene array. We explicitly calculate the transition probabilities on these routes backward in time. We simulate the structured coalescent with migration and coalescence rates informed by the unequal recombination process of gene copies. With this novel interpretation of population structure models we determine coalescence times and expected genetic diversity in samples of orthologous and paralogous copies from a gene family. As a case study, we discuss the site frequency spectrum of a small gene family in the two scenarios of high and of no gene copy number variation among individuals. These examples underline the significance of our model, since standard test-statistics may lead to misinterpretations when analyzing sequence data of multi-copy genes due to their different expected genetic diversity.
Collapse
Affiliation(s)
- Moritz Otto
- University of Cologne, Institute for Genetics, Zuelpicher Str. 47a, Cologne, 50674, Germany
| | - Thomas Wiehe
- University of Cologne, Institute for Genetics, Zuelpicher Str. 47a, Cologne, 50674, Germany.
| |
Collapse
|
38
|
Silcocks M, Farlow A, Hermes A, Tsambos G, Patel HR, Huebner S, Baynam G, Jenkins MR, Vukcevic D, Easteal S, Leslie S. Indigenous Australian genomes show deep structure and rich novel variation. Nature 2023; 624:593-601. [PMID: 38093005 PMCID: PMC10733150 DOI: 10.1038/s41586-023-06831-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 11/03/2023] [Indexed: 12/20/2023]
Abstract
The Indigenous peoples of Australia have a rich linguistic and cultural history. How this relates to genetic diversity remains largely unknown because of their limited engagement with genomic studies. Here we analyse the genomes of 159 individuals from four remote Indigenous communities, including people who speak a language (Tiwi) not from the most widespread family (Pama-Nyungan). This large collection of Indigenous Australian genomes was made possible by careful community engagement and consultation. We observe exceptionally strong population structure across Australia, driven by divergence times between communities of 26,000-35,000 years ago and long-term low but stable effective population sizes. This demographic history, including early divergence from Papua New Guinean (47,000 years ago) and Eurasian groups1, has generated the highest proportion of previously undescribed genetic variation seen outside Africa and the most extended homozygosity compared with global samples. A substantial proportion of this variation is not observed in global reference panels or clinical datasets, and variation with predicted functional consequence is more likely to be homozygous than in other populations, with consequent implications for medical genomics2. Our results show that Indigenous Australians are not a single homogeneous genetic group and their genetic relationship with the peoples of New Guinea is not uniform. These patterns imply that the full breadth of Indigenous Australian genetic diversity remains uncharacterized, potentially limiting genomic medicine and equitable healthcare for Indigenous Australians.
Collapse
Affiliation(s)
- Matthew Silcocks
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- University of Melbourne, School of Biosciences, Parkville, Victoria, Australia
| | - Ashley Farlow
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- University of Melbourne, School of Mathematics and Statistics, Parkville, Victoria, Australia
| | - Azure Hermes
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Georgia Tsambos
- University of Melbourne, School of Mathematics and Statistics, Parkville, Victoria, Australia
| | - Hardip R Patel
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Sharon Huebner
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Gareth Baynam
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- Faculty of Health and Medical Sciences, Division of Paediatrics and Telethon Kids Institute, University of Western Australia, Perth, Western Australia, Australia
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital and Rare Care Centre, Perth Children's Hospital, Perth, Western Australia, Australia
| | - Misty R Jenkins
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
- Immunology Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
- University of Melbourne, Department of Medical Biology, Parkville, Victoria, Australia
| | - Damjan Vukcevic
- University of Melbourne, School of Mathematics and Statistics, Parkville, Victoria, Australia
| | - Simon Easteal
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Stephen Leslie
- National Centre for Indigenous Genomics, John Curtin School of Medical Research, Australian National University, Canberra, Australian Capital Territory, Australia.
- University of Melbourne, School of Biosciences, Parkville, Victoria, Australia.
- University of Melbourne, School of Mathematics and Statistics, Parkville, Victoria, Australia.
| |
Collapse
|
39
|
Jensen A, Swift F, de Vries D, Beck RMD, Kuderna LFK, Knauf S, Chuma IS, Keyyu JD, Kitchener AC, Farh K, Rogers J, Marques-Bonet T, Detwiler KM, Roos C, Guschanski K. Complex Evolutionary History With Extensive Ancestral Gene Flow in an African Primate Radiation. Mol Biol Evol 2023; 40:msad247. [PMID: 37987553 PMCID: PMC10691879 DOI: 10.1093/molbev/msad247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 10/17/2023] [Accepted: 11/09/2023] [Indexed: 11/22/2023] Open
Abstract
Understanding the drivers of speciation is fundamental in evolutionary biology, and recent studies highlight hybridization as an important evolutionary force. Using whole-genome sequencing data from 22 species of guenons (tribe Cercopithecini), one of the world's largest primate radiations, we show that rampant gene flow characterizes their evolutionary history and identify ancient hybridization across deeply divergent lineages that differ in ecology, morphology, and karyotypes. Some hybridization events resulted in mitochondrial introgression between distant lineages, likely facilitated by cointrogression of coadapted nuclear variants. Although the genomic landscapes of introgression were largely lineage specific, we found that genes with immune functions were overrepresented in introgressing regions, in line with adaptive introgression, whereas genes involved in pigmentation and morphology may contribute to reproductive isolation. In line with reports from other systems that hybridization might facilitate diversification, we find that some of the most species-rich guenon clades are of admixed origin. This study provides important insights into the prevalence, role, and outcomes of ancestral hybridization in a large mammalian radiation.
Collapse
Affiliation(s)
- Axel Jensen
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala SE-75236, Sweden
| | - Frances Swift
- School of Biological Sciences, Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, UK
| | - Dorien de Vries
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Robin M D Beck
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Lukas F K Kuderna
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA 94404, USA
| | - Sascha Knauf
- Institute of International Animal Health/One Health, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, Greifswald – Insel Riems 17493, Germany
| | | | - Julius D Keyyu
- Tanzania Wildlife Research Institute (TAWIRI), Arusha, Tanzania
| | - Andrew C Kitchener
- Department of Natural Sciences, National Museums Scotland, Edinburgh EH1 1JF, UK
- School of Geosciences, University of Edinburgh, Edinburgh EH8 9XP, UK
| | - Kyle Farh
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA 94404, USA
| | - Jeffrey Rogers
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Tomas Marques-Bonet
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona 08003, Spain
- Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Barcelona, Spain
- Catalan Institution of Research and Advanced Studies (ICREA), Barcelona, Spain
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona 08028, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA) and Universitat Pompeu Fabra, Barcelona 08010, Spain
| | - Kate M Detwiler
- Department of Biological Sciences, Florida Atlantic University, Boca Raton, FL, USA
| | - Christian Roos
- Gene Bank of Primates and Primate Genetics Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen 37077, Germany
| | - Katerina Guschanski
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, Uppsala SE-75236, Sweden
- School of Biological Sciences, Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
40
|
Miró Pina V, Joly É, Siri-Jégousse A. Estimating the Lambda measure in multiple-merger coalescents. Theor Popul Biol 2023; 154:94-101. [PMID: 37742787 DOI: 10.1016/j.tpb.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/13/2023] [Accepted: 09/15/2023] [Indexed: 09/26/2023]
Abstract
Multiple-merger coalescents, also known as Λ-coalescents, have been used to describe the genealogy of populations that have a skewed offspring distribution or that undergo strong selection. Inferring the characteristic measure Λ, which describes the rates of the multiple-merger events, is key to understand these processes. So far, most inference methods only work for some particular families of Λ-coalescents that are described by only one parameter, but not for more general models. This article is devoted to the construction of a non-parametric estimator of the density of Λ that is based on the observation at a single time of the so-called Site Frequency Spectrum (SFS), which describes the allelic frequencies in a present population sample. First, we produce estimates of the multiple-merger rates by solving a linear system, whose coefficients are obtained by appropriately subsampling the SFS. Then, we use a technique that aggregates the information extracted from the previous step through a kernel type of re-construction to give a non-parametric estimation of the measure Λ. We give a consistency result of this estimator under mild conditions on the behavior of Λ around 0. We also show some numerical examples of how our method performs.
Collapse
Affiliation(s)
- Verónica Miró Pina
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, CDMX, Mexico
| | - Émilien Joly
- Centro de Investigación en Matemáticas, AC (CIMAT), Guanajuato, Mexico
| | - Arno Siri-Jégousse
- Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, CDMX, Mexico.
| |
Collapse
|
41
|
Tsambos G, Kelleher J, Ralph P, Leslie S, Vukcevic D. link-ancestors: fast simulation of local ancestry with tree sequence software. Bioinform Adv 2023; 3:vbad163. [PMID: 38033661 PMCID: PMC10682689 DOI: 10.1093/bioadv/vbad163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/23/2023] [Accepted: 11/16/2023] [Indexed: 12/02/2023]
Abstract
Summary It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.
Collapse
Affiliation(s)
- Georgia Tsambos
- School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- Melbourne Integrative Genomics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- Department of Genome Sciences, University of Washington, Seattle, Washington, 98195, United States
| | - Jerome Kelleher
- Big Data Institute, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| | - Peter Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, 97403, United States
| | - Stephen Leslie
- School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- Melbourne Integrative Genomics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- School of BioSciences, University of Melbourne, Melbourne, Victoria, 3052, Australia
| | - Damjan Vukcevic
- School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- Melbourne Integrative Genomics, University of Melbourne, Melbourne, Victoria, 3052, Australia
- Department of Econometrics and Business Statistics, Monash University, Melbourne, Victoria, 3168, Australia
| |
Collapse
|
42
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. bioRxiv 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
43
|
Farleigh K, Ascanio A, Farleigh ME, Schield DR, Card DC, Leal M, Castoe TA, Jezkova T, Rodríguez-Robles JA. Signals of differential introgression in the genome of natural hybrids of Caribbean anoles. Mol Ecol 2023; 32:6000-6017. [PMID: 37861454 DOI: 10.1111/mec.17170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 08/30/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023]
Abstract
Hybridization facilitates recombination between divergent genetic lineages and can be shaped by both neutral and selective processes. Upon hybridization, loci with no net fitness effects introgress randomly from parental species into the genomes of hybrid individuals. Conversely, alleles from one parental species at some loci may provide a selective advantage to hybrids, resulting in patterns of introgression that do not conform to random expectations. We investigated genomic patterns of differential introgression in natural hybrids of two species of Caribbean anoles, Anolis pulchellus and A. krugi in Puerto Rico. Hybrids exhibit A. pulchellus phenotypes but possess A. krugi mitochondrial DNA, originated from multiple, independent hybridization events, and appear to have replaced pure A. pulchellus across a large area in western Puerto Rico. Combining genome-wide SNP datasets with bioinformatic methods to identify signals of differential introgression in hybrids, we demonstrate that the genomes of hybrids are dominated by pulchellus-derived alleles and show only 10%-20% A. krugi ancestry. The majority of A. krugi loci in hybrids exhibit a signal of non-random differential introgression and include loci linked to genes involved in development and immune function. Three of these genes (delta like canonical notch ligand 1, jagged1 and notch receptor 1) affect cell differentiation and growth and interact with mitochondrial function. Our results suggest that differential non-random introgression for a subset of loci may be driven by selection favouring the inheritance of compatible mitochondrial and nuclear-encoded genes in hybrids.
Collapse
Affiliation(s)
- Keaka Farleigh
- Department of Biology, Miami University, Oxford, Ohio, USA
| | | | | | - Drew R Schield
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Colorado, USA
| | - Daren C Card
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA
- Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts, USA
| | - Manuel Leal
- Division of Biological Sciences, University of Missouri, Columbia, Missouri, USA
| | - Todd A Castoe
- Department of Biology, University of Texas, Arlington, Arlington, Texas, USA
| | - Tereza Jezkova
- Department of Biology, Miami University, Oxford, Ohio, USA
| | | |
Collapse
|
44
|
Rasmussen DA, Guo F. Espalier: Efficient Tree Reconciliation and Ancestral Recombination Graphs Reconstruction Using Maximum Agreement Forests. Syst Biol 2023; 72:1154-1170. [PMID: 37458753 PMCID: PMC10627558 DOI: 10.1093/sysbio/syad040] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 06/26/2023] [Accepted: 06/30/2023] [Indexed: 11/08/2023] Open
Abstract
In the presence of recombination individuals may inherit different regions of their genome from different ancestors, resulting in a mosaic of phylogenetic histories across their genome. Ancestral recombination graphs (ARGs) can capture how phylogenetic relationships vary across the genome due to recombination, but reconstructing ARGs from genomic sequence data is notoriously difficult. Here, we present a method for reconciling discordant phylogenetic trees and reconstructing ARGs using maximum agreement forests (MAFs). Given two discordant trees, a MAF identifies the smallest possible set of topologically concordant subtrees present in both trees. We show how discordant trees can be reconciled through their MAF in a way that retains discordances strongly supported by sequence data while eliminating conflicts likely attributable to phylogenetic noise. We further show how MAFs and our reconciliation approach can be combined to select a path of local trees across the genome that maximizes the likelihood of the genomic sequence data, minimizes discordance between neighboring local trees, and identifies the recombination events necessary to explain remaining discordances to obtain a fully connected ARG. While heuristic, our ARG reconstruction approach is often as accurate as more exact methods while being much more computationally efficient. Moreover, important demographic parameters such as recombination rates can be accurately estimated from reconstructed ARGs. Finally, we apply our approach to plant infecting RNA viruses in the genus Potyvirus to demonstrate how true recombination events can be disentangled from phylogenetic noise using our ARG reconstruction methods.
Collapse
Affiliation(s)
- David A Rasmussen
- Department of Entomology and Plant Pathology, North Carolina State University, Campus Box 7613, Raleigh, NC 27695, USA
- Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh, NC 27695, USA
| | - Fangfang Guo
- Department of Entomology and Plant Pathology, North Carolina State University, Campus Box 7613, Raleigh, NC 27695, USA
| |
Collapse
|
45
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. PLoS Genet 2023; 19:e1011032. [PMID: 37934781 PMCID: PMC10655966 DOI: 10.1371/journal.pgen.1011032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 11/17/2023] [Accepted: 10/23/2023] [Indexed: 11/09/2023] Open
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
46
|
Yüncü E, Işıldak U, Williams MP, Huber CD, Flegontova O, Vyazov LA, Changmai P, Flegontov P. False discovery rates of qpAdm-based screens for genetic admixture. bioRxiv 2023:2023.04.25.538339. [PMID: 37904998 PMCID: PMC10614728 DOI: 10.1101/2023.04.25.538339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Although a broad range of methods exists for reconstructing population history from genome-wide single nucleotide polymorphism data, just a few methods gained popularity in archaeogenetics: principal component analysis (PCA); ADMIXTURE, an algorithm that models individuals as mixtures of multiple ancestral sources represented by actual or inferred populations; formal tests for admixture such as f3-statistics and D/f4-statistics; and qpAdm, a tool for fitting two-component and more complex admixture models to groups or individuals. Despite their popularity in archaeogenetics, which is explained by modest computational requirements and ability to analyze data of various types and qualities, protocols relying on qpAdm that screen numerous alternative models of varying complexity and find "fitting" models (often considering both estimated admixture proportions and p-values as a composite criterion of model fit) remain untested on complex simulated population histories in the form of admixture graphs of random topology. We analyzed genotype data extracted from such simulations and tested various types of high-throughput qpAdm protocols ("rotating" and "non-rotating", with or without temporal stratification of target groups and proxy ancestry sources, and with or without a "model competition" step). We caution that high-throughput qpAdm protocols may be inappropriate for exploratory analyses in poorly studied regions/periods since their false discovery rates varied between 12% and 68% depending on the details of the protocol and on the amount and quality of simulated data (i.e., >12% of fitting two-way admixture models imply gene flows that were not simulated). We demonstrate that for reducing false discovery rates of qpAdm protocols to nearly 0% it is advisable to use large SNP sets with low missing data rates, the rotating qpAdm protocol with a strictly enforced rule that target groups do not pre-date their proxy sources, and an unsupervised ADMIXTURE analysis as a way to verify feasible qpAdm models. Our study has a number of limitations: for instance, these recommendations depend on the assumption that the underlying genetic history is a complex admixture graph and not a stepping-stone model.
Collapse
Affiliation(s)
- Eren Yüncü
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
| | - Ulaş Işıldak
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
| | - Matthew P. Williams
- Department of Biology, Eberly College of Science, Pennsylvania State University, PA, USA
| | - Christian D. Huber
- Department of Biology, Eberly College of Science, Pennsylvania State University, PA, USA
| | - Olga Flegontova
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
- Institute of Parasitology, Biology Centre of the Czech Academy of Sciences, České Budějovice, Czechia
| | - Leonid A. Vyazov
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
| | - Piya Changmai
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
| | - Pavel Flegontov
- Department of Biology and Ecology, Faculty of Science, University of Ostrava, Ostrava, Czechia
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA, USA
| |
Collapse
|
47
|
Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: an empiricist's guide to ancestral recombination graphs. ArXiv 2023:arXiv:2310.12070v1. [PMID: 37904740 PMCID: PMC10614969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
In the presence of recombination, the evolutionary relationships between a set of sampled genomes cannot be described by a single genealogical tree. Instead, the genomes are related by a complex, interwoven collection of genealogies formalized in a structure called an ancestral recombination graph (ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is replete with valuable information for addressing diverse questions in evolutionary biology. Despite its potential utility, technological and methodological limitations, along with a lack of approachable literature, have severely restricted awareness and application of ARGs in empirical evolution research. Excitingly, recent progress in ARG reconstruction and simulation have made ARG-based approaches feasible for many questions and systems. In this review, we provide an accessible introduction and exploration of ARGs, survey recent methodological breakthroughs, and describe the potential for ARGs to further existing goals and open avenues of inquiry that were previously inaccessible in evolutionary genomics. Through this discussion, we aim to more widely disseminate the promise of ARGs in evolutionary genomics and encourage the broader development and adoption of ARG-based inference.
Collapse
Affiliation(s)
- Alexander L Lewanski
- Department of Integrative Biology, Michigan State University, East Lansing, MI, US
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, US
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI, US
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| | - Michael C Grundler
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| | - Gideon S Bradburd
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, US
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, US
| |
Collapse
|
48
|
Mozhaitseva K, Tourrain Z, Branca A. Population Genomics of the Mostly Thelytokous Diplolepis rosae (Linnaeus, 1758) (Hymenoptera: Cynipidae) Reveals Population-specific Selection for Sex. Genome Biol Evol 2023; 15:evad185. [PMID: 37831420 PMCID: PMC10608957 DOI: 10.1093/gbe/evad185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Revised: 10/02/2023] [Accepted: 10/09/2023] [Indexed: 10/14/2023] Open
Abstract
In Hymenoptera, arrhenotokous parthenogenesis (arrhenotoky) is a common reproductive mode. Thelytokous parthenogenesis (thelytoky), when virgin females produce only females, is less common and is found in several taxa. In our study, we assessed the efficacy of recombination and the effect of thelytoky on the genome structure of Diplolepis rosae, a gall wasp-producing bedeguars in dog roses. We assembled a high-quality reference genome using Oxford Nanopore long-read technology and sequenced 17 samples collected in France with high-coverage Illumina reads. We found two D. rosae peripatric lineages that differed in the level of recombination and homozygosity. One of the D. rosae lineages showed a recombination rate that was 13.2 times higher and per-individual heterozygosity that was 1.6 times higher. In the more recombining lineage, the genes enriched in functions related to male traits ('sperm competition", "insemination", and "copulation" gene ontology terms) showed signals of purifying selection, whereas in the less recombining lineage, the same genes showed traces pointing towards balancing or relaxed selection. Thus, although D. rosae reproduces mainly by thelytoky, selection may act to maintain sexual reproduction.
Collapse
Affiliation(s)
- Ksenia Mozhaitseva
- Laboratoire Evolution, Génomes, Comportement, Ecologie, l’Institut Diversité, Ecologie et Evolution du Vivant, Université Paris-Saclay, Gif-sur-Yvette, France
| | - Zoé Tourrain
- Laboratoire Evolution, Génomes, Comportement, Ecologie, l’Institut Diversité, Ecologie et Evolution du Vivant, Université Paris-Saclay, Gif-sur-Yvette, France
| | - Antoine Branca
- Laboratoire Evolution, Génomes, Comportement, Ecologie, l’Institut Diversité, Ecologie et Evolution du Vivant, Université Paris-Saclay, Gif-sur-Yvette, France
| |
Collapse
|
49
|
Medina-Muñoz SG, Ortega-Del Vecchyo D, Cruz-Hervert LP, Ferreyra-Reyes L, García-García L, Moreno-Estrada A, Ragsdale AP. Demographic modeling of admixed Latin American populations from whole genomes. Am J Hum Genet 2023; 110:1804-1816. [PMID: 37725976 PMCID: PMC10577084 DOI: 10.1016/j.ajhg.2023.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 08/17/2023] [Accepted: 08/23/2023] [Indexed: 09/21/2023] Open
Abstract
Demographic models of Latin American populations often fail to fully capture their complex evolutionary history, which has been shaped by both recent admixture and deeper-in-time demographic events. To address this gap, we used high-coverage whole-genome data from Indigenous American ancestries in present-day Mexico and existing genomes from across Latin America to infer multiple demographic models that capture the impact of different timescales on genetic diversity. Our approach, which combines analyses of allele frequencies and ancestry tract length distributions, represents a significant improvement over current models in predicting patterns of genetic variation in admixed Latin American populations. We jointly modeled the contribution of European, African, East Asian, and Indigenous American ancestries into present-day Latin American populations. We infer that the ancestors of Indigenous Americans and East Asians diverged ∼30 thousand years ago, and we characterize genetic contributions of recent migrations from East and Southeast Asia to Peru and Mexico. Our inferred demographic histories are consistent across different genomic regions and annotations, suggesting that our inferences are robust to the potential effects of linked selection. In conjunction with published distributions of fitness effects for new nonsynonymous mutations in humans, we show in large-scale simulations that our models recover important features of both neutral and deleterious variation. By providing a more realistic framework for understanding the evolutionary history of Latin American populations, our models can help address the historical under-representation of admixed groups in genomics research and can be a valuable resource for future studies of populations with complex admixture and demographic histories.
Collapse
Affiliation(s)
- Santiago G Medina-Muñoz
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico
| | - Diego Ortega-Del Vecchyo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de Mexico, Juriquilla, Querétaro 76230, Mexico
| | | | | | | | - Andrés Moreno-Estrada
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico.
| | - Aaron P Ragsdale
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Advanced Genomics Unit (UGA), CINVESTAV, Irapuato, Guanajuato 36824, Mexico; Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
50
|
Laetsch DR, Bisschop G, Martin SH, Aeschbacher S, Setter D, Lohse K. Demographically explicit scans for barriers to gene flow using gIMble. PLoS Genet 2023; 19:e1010999. [PMID: 37816069 PMCID: PMC10610087 DOI: 10.1371/journal.pgen.1010999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 10/27/2023] [Accepted: 09/25/2023] [Indexed: 10/12/2023] Open
Abstract
Identifying regions of the genome that act as barriers to gene flow between recently diverged taxa has remained challenging given the many evolutionary forces that generate variation in genetic diversity and divergence along the genome, and the stochastic nature of this variation. Progress has been impeded by a conceptual and methodological divide between analyses that infer the demographic history of speciation and genome scans aimed at identifying locally maladaptive alleles i.e. genomic barriers to gene flow. Here we implement genomewide IM blockwise likelihood estimation (gIMble), a composite likelihood approach for the quantification of barriers, that bridges this divide. This analytic framework captures background selection and selection against barriers in a model of isolation with migration (IM) as heterogeneity in effective population size (Ne) and effective migration rate (me), respectively. Variation in both effective demographic parameters is estimated in sliding windows via pre-computed likelihood grids. gIMble includes modules for pre-processing/filtering of genomic data and performing parametric bootstraps using coalescent simulations. To demonstrate the new approach, we analyse data from a well-studied pair of sister species of tropical butterflies with a known history of post-divergence gene flow: Heliconius melpomene and H. cydno. Our analyses uncover both large-effect barrier loci (including well-known wing-pattern genes) and a genome-wide signal of a polygenic barrier architecture.
Collapse
Affiliation(s)
- Dominik R. Laetsch
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| | - Gertjan Bisschop
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| | - Simon H. Martin
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| | - Simon Aeschbacher
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | - Derek Setter
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| | - Konrad Lohse
- Institute of Ecology and Evolution, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|