1
|
Salles MMA, Domingos FMCB. Towards the next generation of species delimitation methods: an overview of machine learning applications. Mol Phylogenet Evol 2025; 210:108368. [PMID: 40348350 DOI: 10.1016/j.ympev.2025.108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/25/2025] [Accepted: 05/04/2025] [Indexed: 05/14/2025]
Abstract
Species delimitation is the process of distinguishing between populations of the same species and distinct species of a particular group of organisms. Various methods exist for inferring species limits, whether based on morphological, molecular, or other types of data. In the case of methods based on DNA sequences, most of them are rooted in the coalescent theory. However, coalescence-based models have limitations, for instance regarding complex evolutionary scenarios and large datasets. In this context, machine learning (ML) can be considered as a promising analytical tool, and provides an effective way to explore dataset structures when species-level divergences are hypothesized. In this review, we examine the use of ML in species delimitation and provide an overview and critical appraisal of existing workflows. We also provide simple explanations on how the main types of ML approaches operate, which should help uninitiated researchers and students interested in the field. Our review suggests that while current ML methods designed to infer species limits are analytically powerful, they also present specific limitations and should not be considered as definitive alternatives to coalescent methods for species delimitation. Future ML enterprises to delimit species should consider the constraints related to the use of simulated data, as in other model-based methods relying on simulations. Conversely, the flexibility of ML algorithms offers a significant advantage by enabling the analysis of diverse data types (e.g., genetic and phenotypic) and handling large datasets effectively. We also propose best practices for the use of ML methods in species delimitation, offering insights into potential future applications. We expect that the proposed guidelines will be useful for enhancing the accessibility, effectiveness, and objectivity of ML in species delimitation.
Collapse
Affiliation(s)
- Matheus M A Salles
- Departamento de Zoologia, Universidade Federal do Paraná, Curitiba 81531-980, Brazil.
| | | |
Collapse
|
2
|
Cridland JM, Polston ES, Begun DJ. New perspectives on Drosophila melanogaster de novo gene origination revealed by investigation of ancient African genetic variation. Genetics 2025; 230:iyaf044. [PMID: 40106667 PMCID: PMC12059636 DOI: 10.1093/genetics/iyaf044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Accepted: 03/04/2025] [Indexed: 03/22/2025] Open
Abstract
De novo genes can be defined as sequences producing evolutionarily derived transcripts that are not homologous to transcripts produced in an ancestor. While they appear to be taxonomically widespread, there is little agreement regarding their abundance, their persistence times in genomes, the population genetic processes responsible for their spread or loss, or their possible functions. In Drosophila melanogaster, 2 approaches have been used to discover these genes and investigate their properties. One uses traditional comparative approaches and existing genomic resources and annotations. A second approach uses raw transcriptome data to discover unannotated genes for which there is no evidence of presence in related species. Investigations using the second approach have focused on D. melanogaster genotypes from recently established cosmopolitan populations. However, most of the genetic variation in the species is found in African populations, suggesting the possibility that fuller understanding of genetic novelties in the species may follow from studies of these populations. Here, we investigate de novo gene candidates expressed in testis and accessory glands in a sample of flies from Zambia and compare them with candidate de novo genes expressed in North American populations. We report a large number of previously undiscovered de novo gene candidates, most of which are expressed polymorphically. Many are predicted to code for secreted proteins. In spite of much different levels of genomic variation in Zambian and North American populations, they express similar numbers of candidate de novo genes. We find evidence from genetic analysis of Raleigh inbred lines that a fraction of rarely expressed gene candidates in this population represent deleterious transcription promoted by inbreeding depression. Many de novo gene candidates are expressed in multiple tissues and both sexes, raising questions about how they may interact with natural selection. The relative importance of positive and negative selection, however, remains unclear.
Collapse
Affiliation(s)
- Julie M Cridland
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - Elizabeth S Polston
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - David J Begun
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| |
Collapse
|
3
|
Raymond M, Descary MH, Beaulac C, Larribe F. Constructing ancestral recombination graphs through reinforcement learning. Front Genet 2025; 16:1569358. [PMID: 40364947 PMCID: PMC12069460 DOI: 10.3389/fgene.2025.1569358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Accepted: 04/16/2025] [Indexed: 05/15/2025] Open
Abstract
Introduction Over the years, many approaches have been proposed to build ancestral recombination graphs (ARGs), graphs used to represent the genetic relationship between individuals. Among these methods, many rely on the assumption that the most likely graph is among those with the fewest recombination events. In this paper, we propose a new approach to build maximum parsimony ARGs: Reinforcement Learning (RL). Methods We exploit the similarities between finding the shortest path between a set of genetic sequences and their most recent common ancestor and finding the shortest path between the entrance and exit of a maze, a classic RL problem. In the maze problem, the learner, called the agent, must learn the directions to take in order to escape as quickly as possible, whereas in our problem, the agent must learn the actions to take between coalescence, mutation, and recombination in order to reach the most recent common ancestor as quickly as possible. Results Our results show that RL can be used to build ARGs with as few recombination events as those built with a heuristic algorithm optimized to build minimal ARGs, and sometimes even fewer. Moreover, our method allows to build a distribution of ARGs with few recombination events for a given sample, and can also generalize learning to new samples not used during the learning process. Discussion RL is a promising and innovative approach to build ARGs. By learning to construct ARGs just from the data, our method differs from conventional methods that rely on heuristic rules or complex theoretical models.
Collapse
Affiliation(s)
- Mélanie Raymond
- Department of Mathematics, Université du Québec à Montréal, Montréal, QC, Canada
| | | | | | | |
Collapse
|
4
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. Genetics 2025; 229:1-57. [PMID: 39503241 PMCID: PMC11708920 DOI: 10.1093/genetics/iyae180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 10/18/2024] [Indexed: 11/13/2024] Open
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q, and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q. Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward; thus, it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q. In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q.
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
5
|
Tripathi D, Bhattacharyya C, Basu A. Deep learning insights into distinct patterns of polygenic adaptation across human populations. Nucleic Acids Res 2024; 52:e102. [PMID: 39558170 DOI: 10.1093/nar/gkae1027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 10/10/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024] Open
Abstract
Response to spatiotemporal variation in selection gradients resulted in signatures of polygenic adaptation in human genomes. We introduce RAISING, a two-stage deep learning framework that optimizes neural network architecture through hyperparameter tuning before performing feature selection and prediction tasks. We tested RAISING on published and newly designed simulations that incorporate the complex interplay between demographic history and selection gradients. RAISING outperformed Phylogenetic Generalized Least Squares (PGLS), ridge regression and DeepGenomeScan, with significantly higher true positive rates (TPR) in detecting genetic adaptation. It reduced computational time by 60-fold and increased TPR by up to 28% compared to DeepGenomeScan on published data. In more complex demographic simulations, RAISING showed lower false discoveries and significantly higher TPR, up to 17-fold, compared to other methods. RAISING demonstrated robustness with least sensitivity to demographic history, selection gradient and their interactions. We developed a sliding window method for genome-wide implementation of RAISING to overcome the computational challenges of high-dimensional genomic data. Applied to African, European, South Asian and East Asian populations, we identified multiple genomic regions undergoing polygenic selection. Notably, ∼70% of the regions identified in Africans are unique, with broad patterns distinguishing them from non-Africans, corroborating the Out of Africa dispersal model.
Collapse
Affiliation(s)
- Devashish Tripathi
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
- Regional Centre for Biotechnology, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurugram Expressway, Faridabad 121001, Haryana (Delhi NCR), India
| | - Chandrika Bhattacharyya
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
| | - Analabha Basu
- Biotechnology Research Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), Kalyani, 741251, West Bengal, India
- Regional Centre for Biotechnology, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurugram Expressway, Faridabad 121001, Haryana (Delhi NCR), India
| |
Collapse
|
6
|
Amin MR, Hasan M, DeGiorgio M. Digital Image Processing to Detect Adaptive Evolution. Mol Biol Evol 2024; 41:msae242. [PMID: 39565932 PMCID: PMC11631197 DOI: 10.1093/molbev/msae242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 10/28/2024] [Accepted: 11/13/2024] [Indexed: 11/22/2024] Open
Abstract
In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
7
|
Witt KE, Villanea FA. Computational Genomics and Its Applications to Anthropological Questions. AMERICAN JOURNAL OF BIOLOGICAL ANTHROPOLOGY 2024; 186 Suppl 78:e70010. [PMID: 40071816 PMCID: PMC11898561 DOI: 10.1002/ajpa.70010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 10/14/2024] [Accepted: 12/19/2024] [Indexed: 03/15/2025]
Abstract
The advent of affordable genome sequencing and the development of new computational tools have established a new era of genomic knowledge. Sequenced human genomes number in the tens of thousands, including thousands of ancient human genomes. The abundance of data has been met with new analysis tools that can be used to understand populations' demographic and evolutionary histories. Thus, a variety of computational methods now exist that can be leveraged to answer anthropological questions. This includes novel likelihood and Bayesian methods, machine learning techniques, and a vast array of population simulators. These computational tools provide powerful insights gained from genomic datasets, although they are generally inaccessible to those with less computational experience. Here, we outline the theoretical workings behind computational genomics methods, limitations and other considerations when applying these computational methods, and examples of how computational methods have already been applied to anthropological questions. We hope this review will empower other anthropologists to utilize these powerful tools in their own research.
Collapse
Affiliation(s)
- Kelsey E. Witt
- Department of Genetics and Biochemistry and Center for Human GeneticsClemson UniversityClemsonSouth CarolinaUSA
| | | |
Collapse
|
8
|
Wu CI, Li C. Artificial intelligence and biological research. Natl Sci Rev 2024; 11:nwae415. [PMID: 39611042 PMCID: PMC11604050 DOI: 10.1093/nsr/nwae415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Accepted: 11/13/2024] [Indexed: 11/30/2024] Open
Affiliation(s)
| | - Cai Li
- School of Life Sciences, Sun Yat-sen University, China
| |
Collapse
|
9
|
Whitehouse LS, Ray DD, Schrider DR. Tree Sequences as a General-Purpose Tool for Population Genetic Inference. Mol Biol Evol 2024; 41:msae223. [PMID: 39460991 PMCID: PMC11600592 DOI: 10.1093/molbev/msae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 10/05/2024] [Accepted: 10/17/2024] [Indexed: 10/28/2024] Open
Abstract
As population genetic data increase in size, new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks applied to population genetic alignments. To better utilize these new data structures, we propose and implement a graph convolutional network to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard convolutional neural network approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a graph convolutional network approach and can be used to perform well on these common population genetic inference tasks with accuracies roughly matching or even exceeding that of a convolutional neural network-based method. As tree sequences become more widely used in population genetic research, we foresee developments and optimizations of this work to provide a foundation for population genetic inference moving forward.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
10
|
Cheng X, Steinrücken M. Population Genomic Scans for Natural Selection and Demography. Annu Rev Genet 2024; 58:319-339. [PMID: 39227130 DOI: 10.1146/annurev-genet-111523-102651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Uncovering the fundamental processes that shape genomic variation in natural populations is a primary objective of population genetics. These processes include demographic effects such as past changes in effective population size or gene flow between structured populations. Furthermore, genomic variation is affected by selection on nonneutral genetic variants, for example, through the adaptation of beneficial alleles or balancing selection that maintains genetic variation. In this article, we discuss the characterization of these processes using population genetic models, and we review methods developed on the basis of these models to unravel the underlying processes from modern population genomic data sets. We briefly discuss the conditions in which these approaches can be used to infer demography or identify specific nonneutral genetic variants and cases in which caution is warranted. Moreover, we summarize the challenges of jointly inferring demography and selective processes that affect neutral variation genome-wide.
Collapse
Affiliation(s)
- Xiaoheng Cheng
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA;
| | - Matthias Steinrücken
- Department of Human Genetics, University of Chicago, Chicago, Illinois, USA
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA;
| |
Collapse
|
11
|
Laval G, Patin E, Quintana-Murci L, Kerner G. Deep estimation of the intensity and timing of natural selection from ancient genomes. Mol Ecol Resour 2024; 24:e14015. [PMID: 39215552 DOI: 10.1111/1755-0998.14015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 07/22/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024]
Abstract
Leveraging past allele frequencies has proven to be key for identifying the impact of natural selection across time. However, this approach suffers from imprecise estimations of the intensity (s) and timing (T) of selection, particularly when ancient samples are scarce in specific epochs. Here, we aimed to bypass the computation of allele frequencies across arbitrarily defined past epochs and refine the estimations of selection parameters by implementing convolutional neural networks (CNNs) algorithms that directly use ancient genotypes sampled across time. Using computer simulations, we first show that genotype-based CNNs consistently outperform an approximate Bayesian computation (ABC) approach based on past allele frequency trajectories, regardless of the selection model assumed and the number of available ancient genotypes. When applying this method to empirical data from modern and ancient Europeans, we replicated the reported increased number of selection events in post-Neolithic Europe, independently of the continental subregion studied. Furthermore, we substantially refined the ABC-based estimations of s and T for a set of positively and negatively selected variants, including iconic cases of positive selection and experimentally validated disease-risk variants. Our CNN predictions support a history of recent positive and negative selection targeting variants associated with host defence against pathogens, aligning with previous work that highlights the significant impact of infectious diseases, such as tuberculosis, in Europe. These findings collectively demonstrate that detecting the footprints of natural selection on ancient genomes is crucial for unravelling the history of severe human diseases.
Collapse
Affiliation(s)
- Guillaume Laval
- Human Evolutionary Genetics Unit, Institut Pasteur, Université Paris Cité, CNRS UMR2000, Paris, France
| | - Etienne Patin
- Human Evolutionary Genetics Unit, Institut Pasteur, Université Paris Cité, CNRS UMR2000, Paris, France
| | - Lluis Quintana-Murci
- Human Evolutionary Genetics Unit, Institut Pasteur, Université Paris Cité, CNRS UMR2000, Paris, France
- Chair of Human Genomics and Evolution, Collège de France, Paris, France
| | - Gaspard Kerner
- Human Evolutionary Genetics Unit, Institut Pasteur, Université Paris Cité, CNRS UMR2000, Paris, France
| |
Collapse
|
12
|
Whitehouse LS, Ray D, Schrider DR. Tree sequences as a general-purpose tool for population genetic inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581288. [PMID: 39185244 PMCID: PMC11343121 DOI: 10.1101/2024.02.20.581288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
As population genetics data increases in size new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient, but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks (CNNs) applied to population genetic alignments. To better utilize these new data structures we propose and implement a graph convolutional network (GCN) to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard CNN approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a GCN approach and can be used to perform well on these common population genetics inference tasks with accuracies roughly matching or even exceeding that of a CNN-based method. As tree sequences become more widely used in population genetics research we foresee developments and optimizations of this work to provide a foundation for population genetics inference moving forward.
Collapse
Affiliation(s)
- Logan S. Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Dylan Ray
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| |
Collapse
|
13
|
Zhao S, Chi L, Fu M, Chen H. HaploSweep: Detecting and Distinguishing Recent Soft and Hard Selective Sweeps through Haplotype Structure. Mol Biol Evol 2024; 41:msae192. [PMID: 39288167 PMCID: PMC11452351 DOI: 10.1093/molbev/msae192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 07/29/2024] [Accepted: 09/03/2024] [Indexed: 09/19/2024] Open
Abstract
Identifying soft selective sweeps using genomic data is a challenging yet crucial task in population genetics. In this study, we present HaploSweep, a novel method for detecting and categorizing soft and hard selective sweeps based on haplotype structure. Through simulations spanning a broad range of selection intensities, softness levels, and demographic histories, we demonstrate that HaploSweep outperforms iHS, nSL, and H12 in detecting soft sweeps. HaploSweep achieves high classification accuracy-0.9247 for CHB, 0.9484 for CEU, and 0.9829 YRI-when applied to simulations in line with the human Out-of-Africa demographic model. We also observe that the classification accuracy remains consistently robust across different demographic models. Additionally, we introduce a refined method to accurately distinguish soft shoulders adjacent to hard sweeps from soft sweeps. Application of HaploSweep to genomic data of CHB, CEU, and YRI populations from the 1000 genomes project has led to the discovery of several new genes that bear strong evidence of population-specific soft sweeps (HRNR, AMBRA1, CBFA2T2, DYNC2H1, and RANBP2 etc.), with prevalent associations to immune functions and metabolic processes. The validated performance of HaploSweep, demonstrated through both simulated and real data, underscores its potential as a valuable tool for detecting and comprehending the role of soft sweeps in adaptive evolution.
Collapse
Affiliation(s)
- Shilei Zhao
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lianjiang Chi
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Mincong Fu
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hua Chen
- Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
- School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
- CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
14
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. Mol Ecol Resour 2024; 24:e14005. [PMID: 39152666 DOI: 10.1111/1755-0998.14005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 07/16/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity-by-descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Collapse
Affiliation(s)
- Chris C R Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Gilia Patterson
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| |
Collapse
|
15
|
Kumar H, Qin X, Bhushan B, Dutt T, Panigrahi M. DeepGenomeScan of 15 Worldwide Bovine Populations Detects Spatially Varying Positive Selection Signals. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2024; 28:504-513. [PMID: 39315920 DOI: 10.1089/omi.2024.0154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Identifying genomic regions under selection is essential for understanding the genetic mechanisms driving species evolution and adaptation. Traditional methods often fall short in detecting complex, spatially varying selection signals. Recent advances in deep learning, however, present promising new approaches for uncovering subtle selection signals that traditional methods might miss. In this study, we utilized the deep learning framework DeepGenomeScan to detect spatially varying selection signatures across 15 bovine populations worldwide. Our analysis uncovered novel insights into selective sweep hotspots within the bovine genome, revealing key genes associated with physiological and adaptive traits that were previously undetected. We identified significant quantitative trait loci linked to milk protein and fat percentages. By comparing the selection signatures identified in this study with those reported in the Bovine Genome Variation Database, we discovered 38 novel genes under selection that were not identified through traditional methods. These genes are primarily associated with milk and meat yield and quality. Our findings enhance our understanding of spatially varying selection's impact on bovine genomic diversity, laying a foundation for future research in genetic improvement and conservation. This is the first deep learning-based study of selection signatures in cattle, offering new insights for evolutionary and livestock genomics research.
Collapse
Affiliation(s)
- Harshit Kumar
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
- ICAR-National Research Centre on Mithun, Medziphema, India
| | - Xinghu Qin
- School of Ecology and Nature Conservation, Beijing Forestry University, Beijing, China
| | - Bharat Bhushan
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
| | - Triveni Dutt
- Indian Veterinary Research Institute, Izatnagar, India
| | - Manjit Panigrahi
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
| |
Collapse
|
16
|
Dabi A, Schrider DR. Population size rescaling significantly biases outcomes of forward-in-time population genetic simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.07.588318. [PMID: 38645049 PMCID: PMC11030438 DOI: 10.1101/2024.04.07.588318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q , and compared the deviation of key outcomes (fixation times, allele frequencies, linkage disequilibrium, and the fraction of mutations that fix during the simulation) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q . Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q . In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling procedure's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q .
Collapse
Affiliation(s)
- Amjad Dabi
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA
| |
Collapse
|
17
|
Yang B, Zhou X, Liu S. Tracing the genealogy origin of geographic populations based on genomic variation and deep learning. Mol Phylogenet Evol 2024; 198:108142. [PMID: 38964594 DOI: 10.1016/j.ympev.2024.108142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 05/30/2024] [Accepted: 07/01/2024] [Indexed: 07/06/2024]
Abstract
Assigning a query individual animal or plant to its derived population is a prime task in diverse applications related to organismal genealogy. Such endeavors have conventionally relied on short DNA sequences under a phylogenetic framework. These methods naturally show constraints when the inferred population sources are ambiguously phylogenetically structured, a scenario demanding substantially more informative genetic signals. Recent advances in cost-effective production of whole-genome sequences and artificial intelligence have created an unprecedented opportunity to trace the population origin for essentially any given individual, as long as the genome reference data are comprehensive and standardized. Here, we developed a convolutional neural network method to identify population origins using genomic SNPs. Three empirical datasets (an Asian honeybee, a red fire ant, and a chicken datasets) and two simulated populations are used for the proof of concepts. The performance tests indicate that our method can accurately identify the genealogy origin of query individuals, with success rates ranging from 93 % to 100 %. We further showed that the accuracy of the model can be significantly increased by refining the informative sites through FST filtering. Our method is robust to configurations related to batch sizes and epochs, whereas model learning benefits from the setting of a proper preset learning rate. Moreover, we explained the importance score of key sites for algorithm interpretability and credibility, which has been largely ignored. We anticipate that by coupling genomics and deep learning, our method will see broad potential in conservation and management applications that involve natural resources, invasive pests and weeds, and illegal trades of wildlife products.
Collapse
Affiliation(s)
- Bing Yang
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China.
| | - Shanlin Liu
- Department of Entomology, China Agricultural University, Beijing 100193, China; Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
18
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585300. [PMID: 38559192 PMCID: PMC10980082 DOI: 10.1101/2024.03.15.585300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity by descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology, and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN .
Collapse
|
19
|
Braichenko S, Borges R, Kosiol C. Polymorphism-Aware Models in RevBayes: Species Trees, Disentangling Balancing Selection, and GC-Biased Gene Conversion. Mol Biol Evol 2024; 41:msae138. [PMID: 38980178 PMCID: PMC11272101 DOI: 10.1093/molbev/msae138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 04/19/2024] [Accepted: 07/06/2024] [Indexed: 07/10/2024] Open
Abstract
The role of balancing selection is a long-standing evolutionary puzzle. Balancing selection is a crucial evolutionary process that maintains genetic variation (polymorphism) over extended periods of time; however, detecting it poses a significant challenge. Building upon the Polymorphism-aware phylogenetic Models (PoMos) framework rooted in the Moran model, we introduce a PoMoBalance model. This novel approach is designed to disentangle the interplay of mutation, genetic drift, and directional selection (GC-biased gene conversion), along with the previously unexplored balancing selection pressures on ultra-long timescales comparable with species divergence times by analyzing multi-individual genomic and phylogenetic divergence data. Implemented in the open-source RevBayes Bayesian framework, PoMoBalance offers a versatile tool for inferring phylogenetic trees as well as quantifying various selective pressures. The novel aspect of our approach in studying balancing selection lies in polymorphism-aware phylogenetic models' ability to account for ancestral polymorphisms and incorporate parameters that measure frequency-dependent selection, allowing us to determine the strength of the effect and exact frequencies under selection. We implemented validation tests and assessed the model on the data simulated with SLiM and a custom Moran model simulator. Real sequence analysis of Drosophila populations reveals insights into the evolutionary dynamics of regions subject to frequency-dependent balancing selection, particularly in the context of sex-limited color dimorphism in Drosophila erecta.
Collapse
Affiliation(s)
- Svitlana Braichenko
- Centre for Biological Diversity, School of Biology, University of St Andrews, Fife KY16 9TH, UK
- Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK
| | - Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien 1210, Austria
| | - Carolin Kosiol
- Centre for Biological Diversity, School of Biology, University of St Andrews, Fife KY16 9TH, UK
| |
Collapse
|
20
|
Marsh JI, Johri P. Biases in ARG-Based Inference of Historical Population Size in Populations Experiencing Selection. Mol Biol Evol 2024; 41:msae118. [PMID: 38874402 PMCID: PMC11245712 DOI: 10.1093/molbev/msae118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 06/05/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024] Open
Abstract
Inferring the demographic history of populations provides fundamental insights into species dynamics and is essential for developing a null model to accurately study selective processes. However, background selection and selective sweeps can produce genomic signatures at linked sites that mimic or mask signals associated with historical population size change. While the theoretical biases introduced by the linked effects of selection have been well established, it is unclear whether ancestral recombination graph (ARG)-based approaches to demographic inference in typical empirical analyses are susceptible to misinference due to these effects. To address this, we developed highly realistic forward simulations of human and Drosophila melanogaster populations, including empirically estimated variability of gene density, mutation rates, recombination rates, purifying, and positive selection, across different historical demographic scenarios, to broadly assess the impact of selection on demographic inference using a genealogy-based approach. Our results indicate that the linked effects of selection minimally impact demographic inference for human populations, although it could cause misinference in populations with similar genome architecture and population parameters experiencing more frequent recurrent sweeps. We found that accurate demographic inference of D. melanogaster populations by ARG-based methods is compromised by the presence of pervasive background selection alone, leading to spurious inferences of recent population expansion, which may be further worsened by recurrent sweeps, depending on the proportion and strength of beneficial mutations. Caution and additional testing with species-specific simulations are needed when inferring population history with non-human populations using ARG-based approaches to avoid misinference due to the linked effects of selection.
Collapse
Affiliation(s)
- Jacob I Marsh
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Parul Johri
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
- Integrative Program for Biological and Genome Sciences, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
21
|
Daron J, Bouafou L, Tennessen JA, Rahola N, Makanga B, Akone-Ella O, Ngangue MF, Longo Pendy NM, Paupy C, Neafsey DE, Fontaine MC, Ayala D. Genomic Signatures of Microgeographic Adaptation in Anopheles coluzzii Along an Anthropogenic Gradient in Gabon. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.16.594472. [PMID: 38798379 PMCID: PMC11118577 DOI: 10.1101/2024.05.16.594472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Species distributed across heterogeneous environments often evolve locally adapted populations, but understanding how these persist in the presence of homogenizing gene flow remains puzzling. In Gabon, Anopheles coluzzii, a major African malaria mosquito is found along an ecological gradient, including a sylvatic population, away of any human presence. This study identifies into the genomic signatures of local adaptation in populations from distinct environments including the urban area of Libreville, and two proximate sites 10km apart in the La Lopé National Park (LLP), a village and its sylvatic neighborhood. Whole genome re-sequencing of 96 mosquitoes unveiled ∼ 5.7millions high-quality single nucleotide polymorphisms. Coalescent-based demographic analyses suggest an ∼ 8,000-year-old divergence between Libreville and La Lopé populations, followed by a secondary contact ( ∼ 4,000 ybp) resulting in asymmetric effective gene flow. The urban population displayed reduced effective size, evidence of inbreeding, and strong selection pressures for adaptation to urban settings, as suggested by the hard selective sweeps associated with genes involved in detoxification and insecticide resistance. In contrast, the two geographically proximate LLP populations showed larger effective sizes, and distinctive genomic differences in selective signals, notably soft-selective sweeps on the standing genetic variation. Although neutral loci and chromosomal inversions failed to discriminate between LLP populations, our findings support that microgeographic adaptation can swiftly emerge through selection on standing genetic variation despite high gene flow. This study contributes to the growing understanding of evolution of populations in heterogeneous environments amid ongoing gene flow and how major malaria mosquitoes adapt to human. Significance Anopheles coluzzii , a major African malaria vector, thrives from humid rainforests to dry savannahs and coastal areas. This ecological success is linked to its close association with domestic settings, with human playing significant roles in driving the recent urban evolution of this mosquito. Our research explores the assumption that these mosquitoes are strictly dependent on human habitats, by conducting whole-genome sequencing on An. coluzzii specimens from urban, rural, and sylvatic sites in Gabon. We found that urban mosquitoes show de novo genetic signatures of human-driven vector control, while rural and sylvatic mosquitoes exhibit distinctive genetic evidence of local adaptations derived from standing genetic variation. Understanding adaptation mechanisms of this mosquito is therefore crucial to predict evolution of vector control strategies.
Collapse
|
22
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally Efficient Demographic History Inference from Allele Frequencies with Supervised Machine Learning. Mol Biol Evol 2024; 41:msae077. [PMID: 38636507 PMCID: PMC11082913 DOI: 10.1093/molbev/msae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 04/08/2024] [Accepted: 04/12/2024] [Indexed: 04/20/2024] Open
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite-likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we present donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future genomic data summarized by an AFS. We demonstrate that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ 85721, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Connie K Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Travis J Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| | - Ryan N Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
23
|
Harris M, Kim BY, Garud N. Enrichment of hard sweeps on the X chromosome compared to autosomes in six Drosophila species. Genetics 2024; 226:iyae019. [PMID: 38366786 PMCID: PMC10990427 DOI: 10.1093/genetics/iyae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 01/17/2024] [Accepted: 01/18/2024] [Indexed: 02/18/2024] Open
Abstract
The X chromosome, being hemizygous in males, is exposed one-third of the time increasing the visibility of new mutations to natural selection, potentially leading to different evolutionary dynamics than autosomes. Recently, we found an enrichment of hard selective sweeps over soft selective sweeps on the X chromosome relative to the autosomes in a North American population of Drosophila melanogaster. To understand whether this enrichment is a universal feature of evolution on the X chromosome, we analyze diversity patterns across 6 commonly studied Drosophila species. We find an increased proportion of regions with steep reductions in diversity and elevated homozygosity on the X chromosome compared to autosomes. To assess if these signatures are consistent with positive selection, we simulate a wide variety of evolutionary scenarios spanning variations in demography, mutation rate, recombination rate, background selection, hard sweeps, and soft sweeps and find that the diversity patterns observed on the X are most consistent with hard sweeps. Our findings highlight the importance of sex chromosomes in driving evolutionary processes and suggest that hard sweeps have played a significant role in shaping diversity patterns on the X chromosome across multiple Drosophila species.
Collapse
Affiliation(s)
- Mariana Harris
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Bernard Y Kim
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| | - Nandita Garud
- Department of Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, CA 90095, USA
- Department of Human Genetics, University of California Los Angeles, Los Angeles, CA 90095, USA
| |
Collapse
|
24
|
Bollas AE, Rajkovic A, Ceyhan D, Gaither JB, Mardis ER, White P. SNVstory: inferring genetic ancestry from genome sequencing data. BMC Bioinformatics 2024; 25:76. [PMID: 38378494 PMCID: PMC10877842 DOI: 10.1186/s12859-024-05703-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Accepted: 02/13/2024] [Indexed: 02/22/2024] Open
Abstract
BACKGROUND Genetic ancestry, inferred from genomic data, is a quantifiable biological parameter. While much of the human genome is identical across populations, it is estimated that as much as 0.4% of the genome can differ due to ancestry. This variation is primarily characterized by single nucleotide variants (SNVs), which are often unique to specific genetic populations. Knowledge of a patient's genetic ancestry can inform clinical decisions, from genetic testing and health screenings to medication dosages, based on ancestral disease predispositions. Nevertheless, the current reliance on self-reported ancestry can introduce subjectivity and exacerbate health disparities. While genomic sequencing data enables objective determination of a patient's genetic ancestry, existing approaches are limited to ancestry inference at the continental level. RESULTS To address this challenge, and create an objective, measurable metric of genetic ancestry we present SNVstory, a method built upon three independent machine learning models for accurately inferring the sub-continental ancestry of individuals. We also introduce a novel method for simulating individual samples from aggregate allele frequencies from known populations. SNVstory includes a feature-importance scheme, unique among open-source ancestral tools, which allows the user to track the ancestral signal broadcast by a given gene or locus. We successfully evaluated SNVstory using a clinical exome sequencing dataset, comparing self-reported ethnicity and race to our inferred genetic ancestry, and demonstrate the capability of the algorithm to estimate ancestry from 36 different populations with high accuracy. CONCLUSIONS SNVstory represents a significant advance in methods to assign genetic ancestry, opening the door to ancestry-informed care. SNVstory, an open-source model, is packaged as a Docker container for enhanced reliability and interoperability. It can be accessed from https://github.com/nch-igm/snvstory .
Collapse
Affiliation(s)
- Audrey E Bollas
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Andrei Rajkovic
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA
| | - Defne Ceyhan
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA
| | - Jeffrey B Gaither
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA
| | - Elaine R Mardis
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA
| | - Peter White
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, The Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, USA.
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA.
| |
Collapse
|
25
|
Tran LN, Sun CK, Struck TJ, Sajan M, Gutenkunst RN. Computationally efficient demographic history inference from allele frequencies with supervised machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.24.542158. [PMID: 38405827 PMCID: PMC10888863 DOI: 10.1101/2023.05.24.542158] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Inferring past demographic history of natural populations from genomic data is of central concern in many studies across research fields. Previously, our group had developed dadi, a widely used demographic history inference method based on the allele frequency spectrum (AFS) and maximum composite likelihood optimization. However, dadi's optimization procedure can be computationally expensive. Here, we developed donni (demography optimization via neural network inference), a new inference method based on dadi that is more efficient while maintaining comparable inference accuracy. For each dadi-supported demographic model, donni simulates the expected AFS for a range of model parameters then trains a set of Mean Variance Estimation neural networks using the simulated AFS. Trained networks can then be used to instantaneously infer the model parameters from future input data AFS. We demonstrated that for many demographic models, donni can infer some parameters, such as population size changes, very well and other parameters, such as migration rates and times of demographic events, fairly well. Importantly, donni provides both parameter and confidence interval estimates from input AFS with accuracy comparable to parameters inferred by dadi's likelihood optimization while bypassing its long and computationally intensive evaluation process. donni's performance demonstrates that supervised machine learning algorithms may be a promising avenue for developing more sustainable and computationally efficient demographic history inference methods.
Collapse
Affiliation(s)
- Linh N. Tran
- Genetics Graduate Interdisciplinary Program, University of Arizona, Tucson, AZ, USA
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Connie K. Sun
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Travis J. Struck
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Mathews Sajan
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| | - Ryan N. Gutenkunst
- Department of Molecular & Cellular Biology, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
26
|
Hayeck TJ, Li Y, Mosbruger TL, Bradfield JP, Gleason AG, Damianos G, Shaw GTW, Duke JL, Conlin LK, Turner TN, Fernández-Viña MA, Sarmady M, Monos DS. The Impact of Patterns in Linkage Disequilibrium and Sequencing Quality on the Imprint of Balancing Selection. Genome Biol Evol 2024; 16:evae009. [PMID: 38302106 PMCID: PMC10853003 DOI: 10.1093/gbe/evae009] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 01/08/2024] [Accepted: 01/12/2024] [Indexed: 02/03/2024] Open
Abstract
Regions under balancing selection are characterized by dense polymorphisms and multiple persistent haplotypes, along with other sequence complexities. Successful identification of these patterns depends on both the statistical approach and the quality of sequencing. To address this challenge, at first, a new statistical method called LD-ABF was developed, employing efficient Bayesian techniques to effectively test for balancing selection. LD-ABF demonstrated the most robust detection of selection in a variety of simulation scenarios, compared against a range of existing tests/tools (Tajima's D, HKA, Dng, BetaScan, and BalLerMix). Furthermore, the impact of the quality of sequencing on detection of balancing selection was explored, as well, using: (i) SNP genotyping and exome data, (ii) targeted high-resolution HLA genotyping (IHIW), and (iii) whole-genome long-read sequencing data (Pangenome). In the analysis of SNP genotyping and exome data, we identified known targets and 38 new selection signatures in genes not previously linked to balancing selection. To further investigate the impact of sequencing quality on detection of balancing selection, a detailed investigation of the MHC was performed with high-resolution HLA typing data. Higher quality sequencing revealed the HLA-DQ genes consistently demonstrated strong selection signatures otherwise not observed from the sparser SNP array and exome data. The HLA-DQ selection signature was also replicated in the Pangenome samples using considerably less samples but, with high-quality long-read sequence data. The improved statistical method, coupled with higher quality sequencing, leads to more consistent identification of selection and enhanced localization of variants under selection, particularly in complex regions.
Collapse
Affiliation(s)
- Tristan J Hayeck
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yang Li
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Timothy L Mosbruger
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | | | - Adam G Gleason
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - George Damianos
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Grace Tzun-Wen Shaw
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Jamie L Duke
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Laura K Conlin
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Tychele N Turner
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Marcelo A Fernández-Viña
- Department of Pathology, Stanford University School of Medicine, Palo Alto, CA, USA
- Histocompatibility and Immunogenetics Laboratory, Stanford Blood Center, Palo Alto, CA, USA
| | - Mahdi Sarmady
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Dimitri S Monos
- Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
27
|
Farzan R. Artificial intelligence in Immuno-genetics. Bioinformation 2024; 20:29-35. [PMID: 38352901 PMCID: PMC10859949 DOI: 10.6026/973206300200029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 01/31/2024] [Accepted: 01/31/2024] [Indexed: 02/16/2024] Open
Abstract
Rapid advancements in the field of artificial intelligence (AI) have opened up unprecedented opportunities to revolutionize various scientific domains, including immunology and genetics. Therefore, it is of interest to explore the emerging applications of AI in immunology and genetics, with the objective of enhancing our understanding of the dynamic intricacies of the immune system, disease etiology, and genetic variations. Hence, the use of AI methodologies in immunological and genetic datasets, thereby facilitating the development of innovative approaches in the realms of diagnosis, treatment, and personalized medicine is reviewed.
Collapse
Affiliation(s)
- Raed Farzan
- Department of Clinical Laboratory Sciences, College of Applied Medical Scienecs, King Saud University, Riyadh - 11433, Saudi Arabia
- Center of Excellence in Biotechnology Research, King Saud University, Riyadh - 11433, Saudi Arabia
- Medical and Molecular Genetics Research, King Saud University, Riyadh-11433, Saudi Arabia
| |
Collapse
|
28
|
Huang X, Rymbekova A, Dolgova O, Lao O, Kuhlwilm M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2024; 25:61-78. [PMID: 37666948 DOI: 10.1038/s41576-023-00636-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 09/06/2023]
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
29
|
Lambert S, Voznica J, Morlon H. Deep Learning from Phylogenies for Diversification Analyses. Syst Biol 2023; 72:1262-1279. [PMID: 37556735 DOI: 10.1093/sysbio/syad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 06/20/2023] [Accepted: 08/08/2023] [Indexed: 08/11/2023] Open
Abstract
Birth-death (BD) models are widely used in combination with species phylogenies to study past diversification dynamics. Current inference approaches typically rely on likelihood-based methods. These methods are not generalizable, as a new likelihood formula must be established each time a new model is proposed; for some models, such a formula is not even tractable. Deep learning can bring solutions in such situations, as deep neural networks can be trained to learn the relation between simulations and parameter values as a regression problem. In this paper, we adapt a recently developed deep learning method from pathogen phylodynamics to the case of diversification inference, and we extend its applicability to the case of the inference of state-dependent diversification models from phylogenies associated with trait data. We demonstrate the accuracy and time efficiency of the approach for the time-constant homogeneous BD model and the Binary-State Speciation and Extinction model. Finally, we illustrate the use of the proposed inference machinery by reanalyzing a phylogeny of primates and their associated ecological role as seed dispersers. Deep learning inference provides at least the same accuracy as likelihood-based inference while being faster by several orders of magnitude, offering a promising new inference approach for the deployment of future models in the field.
Collapse
Affiliation(s)
- Sophia Lambert
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
- Institute of Ecology and Evolution, Department of Biology, 5289 University of Oregon, Eugene, OR 97403, USA
| | - Jakub Voznica
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, 25-28 Rue du Dr Roux, 75015 Paris, France
- Unité de Biologie Computationnelle, USR 3756 CNRS, 25-28 Rue du Dr Roux, 75015 Paris, France
| | - Hélène Morlon
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
| |
Collapse
|
30
|
Harris M, Kim B, Garud N. Enrichment of hard sweeps on the X chromosome compared to autosomes in six Drosophila species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.21.545888. [PMID: 38106201 PMCID: PMC10723260 DOI: 10.1101/2023.06.21.545888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The X chromosome, being hemizygous in males, is exposed one third of the time increasing the visibility of new mutations to natural selection, potentially leading to different evolutionary dynamics than autosomes. Recently, we found an enrichment of hard selective sweeps over soft selective sweeps on the X chromosome relative to the autosomes in a North American population of Drosophila melanogaster. To understand whether this enrichment is a universal feature of evolution on the X chromosome, we analyze diversity patterns across six commonly studied Drosophila species. We find an increased proportion of regions with steep reductions in diversity and elevated homozygosity on the X chromosome compared to autosomes. To assess if these signatures are consistent with positive selection, we simulate a wide variety of evolutionary scenarios spanning variations in demography, mutation rate, recombination rate, background selection, hard sweeps, and soft sweeps, and find that the diversity patterns observed on the X are most consistent with hard sweeps. Our findings highlight the importance of sex chromosomes in driving evolutionary processes and suggest that hard sweeps have played a significant role in shaping diversity patterns on the X chromosome across multiple Drosophila species.
Collapse
Affiliation(s)
- Mariana Harris
- Department of Computational Medicine, University of California Los Angeles, Los Angeles California, United States of America
| | - Bernard Kim
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Nandita Garud
- Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles California, United States of America
- Department of Human Genetics, University of California, Los Angeles, California, United States of America
| |
Collapse
|
31
|
Schrider DR. Allelic gene conversion softens selective sweeps. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.05.570141. [PMID: 38106127 PMCID: PMC10723294 DOI: 10.1101/2023.12.05.570141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The prominence of positive selection, in which beneficial mutations are favored by natural selection and rapidly increase in frequency, is a subject of intense debate. Positive selection can result in selective sweeps, in which the haplotype(s) bearing the adaptive allele "sweep" through the population, thereby removing much of the genetic diversity from the region surrounding the target of selection. Two models of selective sweeps have been proposed: classical sweeps, or "hard sweeps", in which a single copy of the adaptive allele sweeps to fixation, and "soft sweeps", in which multiple distinct copies of the adaptive allele leave descendants after the sweep. Soft sweeps can be the outcome of recurrent mutation to the adaptive allele, or the presence of standing genetic variation consisting of multiple copies of the adaptive allele prior to the onset of selection. Importantly, soft sweeps will be common when populations can rapidly adapt to novel selective pressures, either because of a high mutation rate or because adaptive alleles are already present. The prevalence of soft sweeps is especially controversial, and it has been noted that selection on standing variation or recurrent mutations may not always produce soft sweeps. Here, we show that the inverse is true: selection on single-origin de novo mutations may often result in an outcome that is indistinguishable from a soft sweep. This is made possible by allelic gene conversion, which "softens" hard sweeps by copying the adaptive allele onto multiple genetic backgrounds, a process we refer to as a "pseudo-soft" sweep. We carried out a simulation study examining the impact of gene conversion on sweeps from a single de novo variant in models of human, Drosophila, and Arabidopsis populations. The fraction of simulations in which gene conversion had produced multiple haplotypes with the adaptive allele upon fixation was appreciable. Indeed, under realistic demographic histories and gene conversion rates, even if selection always acts on a single-origin mutation, sweeps involving multiple haplotypes are more likely than hard sweeps in large populations, especially when selection is not extremely strong. Thus, even when the mutation rate is low or there is no standing variation, hard sweeps are expected to be the exception rather than the rule in large populations. These results also imply that the presence of signatures of soft sweeps does not necessarily mean that adaptation has been especially rapid or is not mutation limited.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
32
|
Kyriazis CC, Robinson JA, Lohmueller KE. Using Computational Simulations to Model Deleterious Variation and Genetic Load in Natural Populations. Am Nat 2023; 202:737-752. [PMID: 38033186 PMCID: PMC10897732 DOI: 10.1086/726736] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
AbstractDeleterious genetic variation is abundant in wild populations, and understanding the ecological and conservation implications of such variation is an area of active research. Genomic methods are increasingly used to quantify the impacts of deleterious variation in natural populations; however, these approaches remain limited by an inability to accurately predict the selective and dominance effects of mutations. Computational simulations of deleterious variation offer a complementary tool that can help overcome these limitations, although such approaches have yet to be widely employed. In this perspective article, we aim to encourage ecological and conservation genomics researchers to adopt greater use of computational simulations to aid in deepening our understanding of deleterious variation in natural populations. We first provide an overview of the components of a simulation of deleterious variation, describing the key parameters involved in such models. Next, we discuss several approaches for validating simulation models. Finally, we compare and validate several recently proposed deleterious mutation models, demonstrating that models based on estimates of selection parameters from experimental systems are biased toward highly deleterious mutations. We describe a new model that is supported by multiple orthogonal lines of evidence and provide example scripts for implementing this model (https://github.com/ckyriazis/simulations_review).
Collapse
Affiliation(s)
- Christopher C. Kyriazis
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles; Los Angeles, CA, USA
| | - Jacqueline A. Robinson
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles; Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles; Los Angeles, CA, USA
| |
Collapse
|
33
|
Panigrahi M, Rajawat D, Nayak SS, Ghildiyal K, Sharma A, Jain K, Lei C, Bhushan B, Mishra BP, Dutt T. Landmarks in the history of selective sweeps. Anim Genet 2023; 54:667-688. [PMID: 37710403 DOI: 10.1111/age.13355] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 08/28/2023] [Indexed: 09/16/2023]
Abstract
Half a century ago, a seminal article on the hitchhiking effect by Smith and Haigh inaugurated the concept of the selection signature. Selective sweeps are characterised by the rapid spread of an advantageous genetic variant through a population and hence play an important role in shaping evolution and research on genetic diversity. The process by which a beneficial allele arises and becomes fixed in a population, leading to a increase in the frequency of other linked alleles, is known as genetic hitchhiking or genetic draft. Kimura's neutral theory and hitchhiking theory are complementary, with Kimura's neutral evolution as the 'null model' and positive selection as the 'signal'. Both are widely accepted in evolution, especially with genomics enabling precise measurements. Significant advances in genomic technologies, such as next-generation sequencing, high-density SNP arrays and powerful bioinformatics tools, have made it possible to systematically investigate selection signatures in a variety of species. Although the history of selection signatures is relatively recent, progress has been made in the last two decades, owing to the increasing availability of large-scale genomic data and the development of computational methods. In this review, we embark on a journey through the history of research on selective sweeps, ranging from early theoretical work to recent empirical studies that utilise genomic data.
Collapse
Affiliation(s)
- Manjit Panigrahi
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | - Divya Rajawat
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | | | - Kanika Ghildiyal
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | - Anurodh Sharma
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | - Karan Jain
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | - Chuzhao Lei
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, China
| | - Bharat Bhushan
- Division of Animal Genetics, Indian Veterinary Research Institute, Bareilly, India
| | - Bishnu Prasad Mishra
- Division of Animal Biotechnology, ICAR-National Bureau of Animal Genetic Resources, Karnal, India
| | - Triveni Dutt
- Livestock Production and Management Section, Indian Veterinary Research Institute, Bareilly, India
| |
Collapse
|
34
|
Mao J, Cao Y, Zhang Y, Huang B, Zhao Y. A novel method for identifying key genes in macroevolution based on deep learning with attention mechanism. Sci Rep 2023; 13:19727. [PMID: 37957311 PMCID: PMC10643560 DOI: 10.1038/s41598-023-47113-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 11/09/2023] [Indexed: 11/15/2023] Open
Abstract
Macroevolution can be regarded as the result of evolutionary changes of synergistically acting genes. Unfortunately, the importance of these genes in macroevolution is difficult to assess and hence the identification of macroevolutionary key genes is a major challenge in evolutionary biology. In this study, we designed various word embedding libraries of natural language processing (NLP) considering the multiple mechanisms of evolutionary genomics. A novel method (IKGM) based on three types of attention mechanisms (domain attention, kmer attention and fused attention) were proposed to calculate the weights of different genes in macroevolution. Taking 34 species of diurnal butterflies and nocturnal moths in Lepidoptera as an example, we identified a few of key genes with high weights, which annotated to the functions of circadian rhythms, sensory organs, as well as behavioral habits etc. This study not only provides a novel method to identify the key genes of macroevolution at the genomic level, but also helps us to understand the microevolution mechanisms of diurnal butterflies and nocturnal moths in Lepidoptera.
Collapse
Affiliation(s)
- Jiawei Mao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yong Cao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yan Zhang
- College of Mathematics and Physics, Southwest Forestry University, Kunming, 650224, China
| | - Biaosheng Huang
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Youjie Zhao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China.
| |
Collapse
|
35
|
Cecil RM, Sugden LA. On convolutional neural networks for selection inference: Revealing the effect of preprocessing on model learning and the capacity to discover novel patterns. PLoS Comput Biol 2023; 19:e1010979. [PMID: 38011281 PMCID: PMC10703409 DOI: 10.1371/journal.pcbi.1010979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 12/07/2023] [Accepted: 10/26/2023] [Indexed: 11/29/2023] Open
Abstract
A central challenge in population genetics is the detection of genomic footprints of selection. As machine learning tools including convolutional neural networks (CNNs) have become more sophisticated and applied more broadly, these provide a logical next step for increasing our power to learn and detect such patterns; indeed, CNNs trained on simulated genome sequences have recently been shown to be highly effective at this task. Unlike previous approaches, which rely upon human-crafted summary statistics, these methods are able to be applied directly to raw genomic data, allowing them to potentially learn new signatures that, if well-understood, could improve the current theory surrounding selective sweeps. Towards this end, we examine a representative CNN from the literature, paring it down to the minimal complexity needed to maintain comparable performance; this low-complexity CNN allows us to directly interpret the learned evolutionary signatures. We then validate these patterns in more complex models using metrics that evaluate feature importance. Our findings reveal that preprocessing steps, which determine how the population genetic data is presented to the model, play a central role in the learned prediction method. This results in models that mimic previously-defined summary statistics; in one case, the summary statistic itself achieves similarly high accuracy. For evolutionary processes that are less well understood than selective sweeps, we hope this provides an initial framework for using CNNs in ways that go beyond simply achieving high classification performance. Instead, we propose that CNNs might be useful as tools for learning novel patterns that can translate to easy-to-implement summary statistics available to a wider community of researchers.
Collapse
Affiliation(s)
- Ryan M. Cecil
- Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pennsylvania, United States of America
- Department of Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Lauren A. Sugden
- Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
36
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. PLoS Genet 2023; 19:e1011032. [PMID: 37934781 PMCID: PMC10655966 DOI: 10.1371/journal.pgen.1011032] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 11/17/2023] [Accepted: 10/23/2023] [Indexed: 11/09/2023] Open
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
37
|
Kloska A, Giełczyk A, Grzybowski T, Płoski R, Kloska SM, Marciniak T, Pałczyński K, Rogalla-Ładniak U, Malyarchuk BA, Derenko MV, Kovačević-Grujičić N, Stevanović M, Drakulić D, Davidović S, Spólnicka M, Zubańska M, Woźniak M. A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe. Int J Mol Sci 2023; 24:15095. [PMID: 37894775 PMCID: PMC10606184 DOI: 10.3390/ijms242015095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/03/2023] [Accepted: 10/07/2023] [Indexed: 10/29/2023] Open
Abstract
Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.
Collapse
Affiliation(s)
- Anna Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Agata Giełczyk
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Grzybowski
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Rafał Płoski
- Department of Medical Genetics, Warsaw Medical University, 02106 Warsaw, Poland
| | - Sylwester M. Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Marciniak
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Krzysztof Pałczyński
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Urszula Rogalla-Ładniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Boris A. Malyarchuk
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Miroslava V. Derenko
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Nataša Kovačević-Grujičić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Milena Stevanović
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
- Faculty of Biology, University of Belgrade, 11000 Belgrade, Serbia
- Serbian Academy of Sciences and Arts, 11000 Belgrade, Serbia
| | - Danijela Drakulić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Slobodan Davidović
- Institute for Biological Research “Siniša Stanković”, National Institute of Republic of Serbia, University of Belgrade, 11060 Belgrade, Serbia
| | | | - Magdalena Zubańska
- Faculty of Law and Administration, Department of Criminology and Forensic Sciences, University of Warmia and Mazury, 10726 Olsztyn, Poland
| | - Marcin Woźniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| |
Collapse
|
38
|
Amin MR, Hasan M, Arnab SP, DeGiorgio M. Tensor Decomposition-based Feature Extraction and Classification to Detect Natural Selection from Genomic Data. Mol Biol Evol 2023; 40:msad216. [PMID: 37772983 PMCID: PMC10581699 DOI: 10.1093/molbev/msad216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 08/10/2023] [Accepted: 09/14/2023] [Indexed: 09/30/2023] Open
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.
Collapse
Affiliation(s)
- Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Mahmudul Hasan
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
39
|
Nait Saada J, Tsangalidou Z, Stricker M, Palamara PF. Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks. Mol Biol Evol 2023; 40:msad211. [PMID: 37738175 PMCID: PMC10581698 DOI: 10.1093/molbev/msad211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 09/11/2023] [Accepted: 09/18/2023] [Indexed: 09/24/2023] Open
Abstract
Accurate inference of the time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses. We developed a likelihood-free approach, called CoalNN, which uses a convolutional neural network to predict pairwise TMRCAs and allele ages from sequencing or SNP array data. CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. Across several simulated scenarios, CoalNN matched or outperformed the accuracy of model-based approaches for pairwise TMRCA and allele age prediction. We applied CoalNN to settings for which model-based approaches are under-developed and performed analyses to gain insights into the set of features it uses to perform TMRCA prediction. We next used CoalNN to analyze 2,504 samples from 26 populations in the 1,000 Genome Project data set, inferring the age of ∼80 million variants. We observed substantial variation across populations and for variants predicted to be pathogenic, reflecting heterogeneous demographic histories and the action of negative selection. We used CoalNN's predicted allele ages to construct genome-wide annotations capturing the signature of past negative selection. We performed LD-score regression analysis of heritability using summary association statistics from 63 independent complex traits and diseases (average N=314k), observing increased annotation-specific effects on heritability compared to a previous allele age annotation. These results highlight the effectiveness of using likelihood-free, simulation-trained models to infer properties of gene genealogies in large genomic data sets.
Collapse
Affiliation(s)
| | | | | | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
40
|
Carvalho J, Morales HE, Faria R, Butlin RK, Sousa VC. Integrating Pool-seq uncertainties into demographic inference. Mol Ecol Resour 2023; 23:1737-1755. [PMID: 37475177 DOI: 10.1111/1755-0998.13834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 06/16/2023] [Accepted: 06/30/2023] [Indexed: 07/22/2023]
Abstract
Next-generation sequencing of pooled samples (Pool-seq) is a popular method to assess genome-wide diversity patterns in natural and experimental populations. However, Pool-seq is associated with specific sources of noise, such as unequal individual contributions. Consequently, using Pool-seq for the reconstruction of evolutionary history has remained underexplored. Here we describe a novel Approximate Bayesian Computation (ABC) method to infer demographic history, explicitly modelling Pool-seq sources of error. By jointly modelling Pool-seq data, demographic history and the effects of selection due to barrier loci, we obtain estimates of demographic history parameters accounting for technical errors associated with Pool-seq. Our ABC approach is computationally efficient as it relies on simulating subsets of loci (rather than the whole-genome) and on using relative summary statistics and relative model parameters. Our simulation study results indicate Pool-seq data allows distinction between general scenarios of ecotype formation (single versus parallel origin) and to infer relevant demographic parameters (e.g. effective sizes and split times). We exemplify the application of our method to Pool-seq data from the rocky-shore gastropod Littorina saxatilis, sampled on a narrow geographical scale at two Swedish locations where two ecotypes (Wave and Crab) are found. Our model choice and parameter estimates show that ecotypes formed before colonization of the two locations (i.e. single origin) and are maintained despite gene flow. These results indicate that demographic modelling and inference can be successful based on pool-sequencing using ABC, contributing to the development of suitable null models that allow for a better understanding of the genetic basis of divergent adaptation.
Collapse
Affiliation(s)
- João Carvalho
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Portugal
| | - Hernán E Morales
- Section for Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Rui Faria
- CIBIO - Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO, Laboratório Associado, Universidade do Porto, Vairão, Portugal
- BIOPOLIS Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão, Portugal
| | - Roger K Butlin
- Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield, UK
- Department of Marine Sciences, University of Gothenburg, Gothenburg, Sweden
| | - Vítor C Sousa
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Portugal
| |
Collapse
|
41
|
Mo Z, Siepel A. Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.01.529396. [PMID: 36909514 PMCID: PMC10002701 DOI: 10.1101/2023.03.01.529396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Ziyi Mo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
- School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
| |
Collapse
|
42
|
Arnab SP, Amin MR, DeGiorgio M. Uncovering Footprints of Natural Selection Through Spectral Analysis of Genomic Summary Statistics. Mol Biol Evol 2023; 40:msad157. [PMID: 37433019 PMCID: PMC10365025 DOI: 10.1093/molbev/msad157] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 06/28/2023] [Accepted: 07/06/2023] [Indexed: 07/13/2023] Open
Abstract
Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Md Ruhul Amin
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA
| |
Collapse
|
43
|
Wade EE, Kyriazis CC, Cavassim MIA, Lohmueller KE. Quantifying the fraction of new mutations that are recessive lethal. Evolution 2023; 77:1539-1549. [PMID: 37074880 PMCID: PMC10309970 DOI: 10.1093/evolut/qpad061] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 03/21/2023] [Accepted: 04/14/2023] [Indexed: 04/20/2023]
Abstract
The presence and impact of recessive lethal mutations have been widely documented in diploid outcrossing species. However, precise estimates of the proportion of new mutations that are recessive lethal remain limited. Here, we evaluate the performance of Fit∂a∂i, a commonly used method for inferring the distribution of fitness effects (DFE), in the presence of lethal mutations. Using simulations, we demonstrate that in both additive and recessive cases, inference of the deleterious nonlethal portion of the DFE is minimally affected by a small proportion (<10%) of lethal mutations. Additionally, we demonstrate that while Fit∂a∂i cannot estimate the fraction of recessive lethal mutations, Fit∂a∂i can accurately infer the fraction of additive lethal mutations. Finally, as an alternative approach to estimate the proportion of mutations that are recessive lethal, we employ models of mutation-selection-drift balance using existing genomic parameters and estimates of segregating recessive lethals for humans and Drosophila melanogaster. In both species, the segregating recessive lethal load can be explained by a very small fraction (<1%) of new nonsynonymous mutations being recessive lethal. Our results refute recent assertions of a much higher proportion of mutations being recessive lethal (4%-5%), while highlighting the need for additional information on the joint distribution of selection and dominance coefficients.
Collapse
Affiliation(s)
- Emma E Wade
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
- Department of Computer Science and Engineering, Mississippi State University, Starkville, MS, United States
| | - Christopher C Kyriazis
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
| | - Maria Izabel A Cavassim
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
| | - Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
- Interdepartmental Program in Bioinformatics, University of California–Los Angeles, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine, University of California–Los Angeles, Los Angeles, CA, United States
| |
Collapse
|
44
|
Smith CCR, Tittes S, Ralph PL, Kern AD. Dispersal inference from population genetic variation using a convolutional neural network. Genetics 2023; 224:iyad068. [PMID: 37052957 PMCID: PMC10213498 DOI: 10.1093/genetics/iyad068] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 02/08/2023] [Accepted: 04/07/2023] [Indexed: 04/14/2023] Open
Abstract
The geographic nature of biological dispersal shapes patterns of genetic variation over landscapes, making it possible to infer properties of dispersal from genetic variation data. Here, we present an inference tool that uses geographically distributed genotype data in combination with a convolutional neural network to estimate a critical population parameter: the mean per-generation dispersal distance. Using extensive simulation, we show that our deep learning approach is competitive with or outperforms state-of-the-art methods, particularly at small sample sizes. In addition, we evaluate varying nuisance parameters during training-including population density, demographic history, habitat size, and sampling area-and show that this strategy is effective for estimating dispersal distance when other model parameters are unknown. Whereas competing methods depend on information about local population density or accurate inference of identity-by-descent tracts, our method uses only single-nucleotide-polymorphism data and the spatial scale of sampling as input. Strikingly, and unlike other methods, our method does not use the geographic coordinates of the genotyped individuals. These features make our method, which we call "disperseNN," a potentially valuable new tool for estimating dispersal distance in nonmodel systems with whole genome data or reduced representation data. We apply disperseNN to 12 different species with publicly available data, yielding reasonable estimates for most species. Importantly, our method estimated consistently larger dispersal distances than mark-recapture calculations in the same species, which may be due to the limited geographic sampling area covered by some mark-recapture studies. Thus genetic tools like ours complement direct methods for improving our understanding of dispersal.
Collapse
Affiliation(s)
- Chris C R Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Silas Tittes
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| |
Collapse
|
45
|
Zhao L, Walkowiak S, Fernando WGD. Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health. PLANTS (BASEL, SWITZERLAND) 2023; 12:plants12091852. [PMID: 37176910 PMCID: PMC10180744 DOI: 10.3390/plants12091852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 04/25/2023] [Accepted: 04/27/2023] [Indexed: 05/15/2023]
Abstract
There is increasing interest in harnessing the microbiome to improve cropping systems. With the availability of high-throughput and low-cost sequencing technologies, gathering microbiome data is becoming more routine. However, the analysis of microbiome data is challenged by the size and complexity of the data, and the incomplete nature of many microbiome databases. Further, to bring microbiome data value, it often needs to be analyzed in conjunction with other complex data that impact on crop health and disease management, such as plant genotype and environmental factors. Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as the interplay between the microbiome, crop plants, and their environment. In this review, we aim to provide readers with a brief introduction to AI techniques, and we introduce how AI has been applied to areas of microbiome sequencing taxonomy, the functional annotation for microbiome sequences, associating the microbiome community with host traits, designing synthetic communities, genomic selection, field phenotyping, and disease forecasting. At the end of this review, we proposed further efforts that are required to fully exploit the power of AI in studying phytomicrobiomes.
Collapse
Affiliation(s)
- Liang Zhao
- Department of Plant Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | | | | |
Collapse
|
46
|
Hamid I, Korunes KL, Schrider DR, Goldberg A. Localizing Post-Admixture Adaptive Variants with Object Detection on Ancestry-Painted Chromosomes. Mol Biol Evol 2023; 40:msad074. [PMID: 36947126 PMCID: PMC10116606 DOI: 10.1093/molbev/msad074] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 03/14/2023] [Accepted: 03/20/2023] [Indexed: 03/23/2023] Open
Abstract
Gene flow between previously differentiated populations during the founding of an admixed or hybrid population has the potential to introduce adaptive alleles into the new population. If the adaptive allele is common in one source population, but not the other, then as the adaptive allele rises in frequency in the admixed population, genetic ancestry from the source containing the adaptive allele will increase nearby as well. Patterns of genetic ancestry have therefore been used to identify post-admixture positive selection in humans and other animals, including examples in immunity, metabolism, and animal coloration. A common method identifies regions of the genome that have local ancestry "outliers" compared with the distribution across the rest of the genome, considering each locus independently. However, we lack theoretical models for expected distributions of ancestry under various demographic scenarios, resulting in potential false positives and false negatives. Further, ancestry patterns between distant sites are often not independent. As a result, current methods tend to infer wide genomic regions containing many genes as under selection, limiting biological interpretation. Instead, we develop a deep learning object detection method applied to images generated from local ancestry-painted genomes. This approach preserves information from the surrounding genomic context and avoids potential pitfalls of user-defined summary statistics. We find the method is robust to a variety of demographic misspecifications using simulated data. Applied to human genotype data from Cabo Verde, we localize a known adaptive locus to a single narrow region compared with multiple or long windows obtained using two other ancestry-based methods.
Collapse
Affiliation(s)
- Iman Hamid
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| | | | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| | - Amy Goldberg
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| |
Collapse
|
47
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
48
|
Zhang X, Kim B, Singh A, Sankararaman S, Durvasula A, Lohmueller KE. MaLAdapt Reveals Novel Targets of Adaptive Introgression From Neanderthals and Denisovans in Worldwide Human Populations. Mol Biol Evol 2023; 40:msad001. [PMID: 36617238 PMCID: PMC9887621 DOI: 10.1093/molbev/msad001] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 12/25/2022] [Accepted: 12/28/2022] [Indexed: 01/09/2023] Open
Abstract
Adaptive introgression (AI) facilitates local adaptation in a wide range of species. Many state-of-the-art methods detect AI with ad-hoc approaches that identify summary statistic outliers or intersect scans for positive selection with scans for introgressed genomic regions. Although widely used, approaches intersecting outliers are vulnerable to a high false-negative rate as the power of different methods varies, especially for complex introgression events. Moreover, population genetic processes unrelated to AI, such as background selection or heterosis, may create similar genomic signals to AI, compromising the reliability of methods that rely on neutral null distributions. In recent years, machine learning (ML) methods have been increasingly applied to population genetic questions. Here, we present a ML-based method called MaLAdapt for identifying AI loci from genome-wide sequencing data. Using an Extra-Trees Classifier algorithm, our method combines information from a large number of biologically meaningful summary statistics to capture a powerful composite signature of AI across the genome. In contrast to existing methods, MaLAdapt is especially well-powered to detect AI with mild beneficial effects, including selection on standing archaic variation, and is robust to non-AI selective sweeps, heterosis from deleterious mutations, and demographic misspecification. Furthermore, MaLAdapt outperforms existing methods for detecting AI based on the analysis of simulated data and the validation of empirical signals through visual inspection of haplotype patterns. We apply MaLAdapt to the 1000 Genomes Project human genomic data and discover novel AI candidate regions in non-African populations, including genes that are enriched in functionally important biological pathways regulating metabolism and immune responses.
Collapse
Affiliation(s)
- Xinjun Zhang
- Department of Ecology and Evolutionary Biology, UCLA, Los Angeles, CA
| | - Bernard Kim
- Department of Biology, Stanford University, Palo Alto, CA
| | - Armaan Singh
- Department of Computer Science, UCLA, Los Angeles, CA
| | - Sriram Sankararaman
- Department of Computer Science, UCLA, Los Angeles, CA
- Department of Computational Medicine, UCLA, Los Angeles, CA
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA
| | - Arun Durvasula
- Department of Genetics, Harvard Medical School, Boston, MA
| | - Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, UCLA, Los Angeles, CA
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA
| |
Collapse
|
49
|
Harris M, Garud NR. Enrichment of Hard Sweeps on the X Chromosome in Drosophila melanogaster. Mol Biol Evol 2022; 40:6955808. [PMID: 36546413 PMCID: PMC9825254 DOI: 10.1093/molbev/msac268] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 11/11/2022] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
The characteristic properties of the X chromosome, such as male hemizygosity and its unique inheritance pattern, expose it to natural selection in a way that can be different from the autosomes. Here, we investigate the differences in the tempo and mode of adaptation on the X chromosome and autosomes in a population of Drosophila melanogaster. Specifically, we test the hypothesis that due to hemizygosity and a lower effective population size on the X, the relative proportion of hard sweeps, which are expected when adaptation is gradual, compared with soft sweeps, which are expected when adaptation is rapid, is greater on the X than on the autosomes. We quantify the incidence of hard versus soft sweeps in North American D. melanogaster population genomic data with haplotype homozygosity statistics and find an enrichment of the proportion of hard versus soft sweeps on the X chromosome compared with the autosomes, confirming predictions we make from simulations. Understanding these differences may enable a deeper understanding of how important phenotypes arise as well as the impact of fundamental evolutionary parameters on adaptation, such as dominance, sex-specific selection, and sex-biased demography.
Collapse
Affiliation(s)
- Mariana Harris
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA
| | | |
Collapse
|
50
|
Provost KL, Yang J, Carstens BC. The impacts of fine-tuning, phylogenetic distance, and sample size on big-data bioacoustics. PLoS One 2022; 17:e0278522. [PMID: 36477744 PMCID: PMC9728902 DOI: 10.1371/journal.pone.0278522] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 11/17/2022] [Indexed: 12/12/2022] Open
Abstract
Vocalizations in animals, particularly birds, are critically important behaviors that influence their reproductive fitness. While recordings of bioacoustic data have been captured and stored in collections for decades, the automated extraction of data from these recordings has only recently been facilitated by artificial intelligence methods. These have yet to be evaluated with respect to accuracy of different automation strategies and features. Here, we use a recently published machine learning framework to extract syllables from ten bird species ranging in their phylogenetic relatedness from 1 to 85 million years, to compare how phylogenetic relatedness influences accuracy. We also evaluate the utility of applying trained models to novel species. Our results indicate that model performance is best on conspecifics, with accuracy progressively decreasing as phylogenetic distance increases between taxa. However, we also find that the application of models trained on multiple distantly related species can improve the overall accuracy to levels near that of training and analyzing a model on the same species. When planning big-data bioacoustics studies, care must be taken in sample design to maximize sample size and minimize human labor without sacrificing accuracy.
Collapse
Affiliation(s)
- Kaiya L. Provost
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, Ohio, United States of America
| | - Jiaying Yang
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, Ohio, United States of America
| | - Bryan C. Carstens
- Department of Evolution, Ecology and Organismal Biology, The Ohio State University, Columbus, Ohio, United States of America
| |
Collapse
|