1
|
Dash HR, Patel A. Genealogically bewildered individuals and forensic identification: a review of current and emerging solutions. Int J Legal Med 2025:10.1007/s00414-025-03513-2. [PMID: 40411594 DOI: 10.1007/s00414-025-03513-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2025] [Accepted: 05/10/2025] [Indexed: 05/26/2025]
Abstract
The increasing use of assisted reproductive technologies (ART) with donor gametes is driven by rising infertility rates, delayed parenthood, and the need to prevent hereditary diseases. Greater social acceptance of diverse family structures, advancements in reproductive medicine, and improving success rates also contribute. Accessibility, affordability, and cross-border reproductive care further expand ART's reach, making donor gametes a preferred option for many individuals and couples worldwide. The widespread application of ART has led to an increasing number of donor-conceived individuals, many of whom are now reaching reproductive maturity. This demographic shift introduces significant challenges for traditional forensic genetic identification methods, which rely on biological reference samples from genetically related individuals. The absence of such samples complicates the identification process, particularly for individuals conceived via gamete donation or adoption, where biological and legal parentage are incongruent. Conventional forensic genetic analyses, including short tandem repeat (STR) and single nucleotide polymorphism (SNP) profiling of autosomal, Y-chromosome, X-chromosome, and mitochondrial DNA, exhibit limited efficacy in these scenarios. While these methods can sometimes identify individuals conceived using a single donor gamete, they are insufficient for cases involving dual donor gametes or mitochondrial replacement therapy. Emerging methodologies such as forensic genetic genealogy, DNA methylation profiling, and human microbiome analysis offer innovative approaches but necessitate further clinical validation and standardization.
Collapse
Affiliation(s)
- Hirak Ranjan Dash
- Department of Forensic Science, National Forensic Sciences University, Delhi Campus, New Delhi, 110085, India.
- School of Forensic Sciences, Centurion University of Technology and Management, Bhubaneswar, Odisha, 752050, India.
| | - Anubhuti Patel
- Department of Reproductive Medicine and the Center for Human Reproduction, IMS and SUM Hospital, Bhubaneswar, Odisha, 751003, India
| |
Collapse
|
2
|
Arnab SP, Campelo dos Santos AL, Fumagalli M, DeGiorgio M. Efficient Detection and Characterization of Targets of Natural Selection Using Transfer Learning. Mol Biol Evol 2025; 42:msaf094. [PMID: 40341942 PMCID: PMC12062966 DOI: 10.1093/molbev/msaf094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 04/16/2025] [Accepted: 04/17/2025] [Indexed: 05/11/2025] Open
Abstract
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pretrained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | | | - Matteo Fumagalli
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
- The Alan Turing Institute, London, UK
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
3
|
Tan M, Tan Y, Jiang H, Xue J, Wu Q, Zheng Y, Liu G, Xiao Y, Lv M, Liao M, Zhang L, Qu S, Liang W. Explainable artificial intelligence in forensic DNA analysis: Alleles identification in challenging electropherograms using supervised machine learning methods. Forensic Sci Int Genet 2025; 78:103289. [PMID: 40288204 DOI: 10.1016/j.fsigen.2025.103289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Revised: 03/17/2025] [Accepted: 04/23/2025] [Indexed: 04/29/2025]
Abstract
Challenging samples in capillary electrophoresis (CE)-based short tandem repeat (STR) analysis often produce artefactual signals that cannot be completely filtered out by expert electropherogram (EPG) reading systems, complicating allele interpretation. Previous studies have demonstrated the potential of artificial intelligence (AI) to address this issue by accurately distinguishing allele signals from artefacts in EPGs. Traditional machine learning models offer significant advantages in enhancing the interpretability and transparency of AI models used in DNA analysis, particularly in criminal investigations and legal contexts. In this study, five traditional machine learning algorithms were employed to train and construct models using EPG signal datasets from single-source low-template EPGs, mixture EPGs, and combined datasets. Performance evaluation and validation with additional datasets demonstrated the feasibility of these models in improving the reportability of potential information in EPGs. However, further optimization is needed for mixture EPGs to enhance classification accuracy. Implementing Receiver Operating Characteristic (ROC) curve analysis and prediction probability thresholds effectively reduced false positive classifications. Additionally, a user-friendly platform was developed for EPG signal classification based on machine learning and ensemble learning, allowing for the classification of any signal datasets using traditional machine learning models and combining the prediction results of multiple models. This platform will provide analysts with more optimal and robust results. This study shows that machine-learning-based EPG signal classification models can significantly enhance the efficiency of sample analysis and interpretation, providing a solid foundation for future research.
Collapse
Affiliation(s)
- Mengyu Tan
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Yuxuan Tan
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Haoyan Jiang
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Jiaming Xue
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Qiushuo Wu
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Yazi Zheng
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Guihong Liu
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Yuanyuan Xiao
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Meili Lv
- Department of Immunology, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, Sichuan 610041, China
| | - Miao Liao
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Lin Zhang
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China
| | - Shengqiu Qu
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China.
| | - Weibo Liang
- Department of Forensic Genetics, West China School of Basic Medical Sciences and Forensic Medicine, Sichuan University, Chengdu, China.
| |
Collapse
|
4
|
Rehmann CT, Small ST, Ralph PL, Kern AD. Sweeps in space: leveraging geographic data to identify beneficial alleles in Anopheles gambiae. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.07.637123. [PMID: 39975147 PMCID: PMC11839090 DOI: 10.1101/2025.02.07.637123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
As organisms adapt to environmental changes, natural selection modifies the frequency of non-neutral alleles. For beneficial mutations, the outcome of this process may be a selective sweep, in which an allele rapidly increases in frequency and perhaps reaches fixation within a population. Selective sweeps have well-studied effects on patterns of local genetic variation in panmictic populations, but much less is known about the dynamics of sweeps in continuous space. In particular, because limited movement across a landscape leads to unique patterns of population structure, spatial dynamics may influence the trajectory of selected mutations. Here, we use forward-in-time, individual-based simulations in continuous space to study the impact of space on beneficial mutations as they sweep through a population. In particular, we show that selection changes the joint distribution of allele frequency and geographic range occupied by a focal allele and demonstrate that this signal can be used to identify selective sweeps. We then leverage this signal to identify in-progress selective sweeps within the malaria vector Anopheles gambiae , a species under strong selection pressure from vector control measures. By considering space, we identify multiple previously undescribed variants with potential phenotypic consequences, including mutations impacting known IR-associated genes and altering protein structure and properties. Our results demonstrate a novel signal for detecting selection in spatial population genetic data that may have implications for genomic surveillance and understanding geographic patterns of genetic variation.
Collapse
|
5
|
Oyeniran KA, Tenibiaje MO. Detectable episodic positive selection in the virion strand a-strain Maize streak virus genes may have a role in its host adaptation. Virus Genes 2025:10.1007/s11262-025-02157-z. [PMID: 40237943 DOI: 10.1007/s11262-025-02157-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Accepted: 04/06/2025] [Indexed: 04/18/2025]
Abstract
Maize streak virus (MSV) has four genes: cp, encoding the coat protein; mp, the movement protein; and repA and rep, encoding two distinct replication-associated proteins from an alternatively spliced transcript. These genes play roles in encapsidation, movement, replication, and interactions with the external environment, making them prone to stimuli-driven molecular adaptation. We accomplished selection studies on publicly available curated, recombination-free, complete coding sequences for representative A-strain maize streak virus (MSV-A) cp and mp genes. We found evidence of gene-wide selection in these two MSV genes at specific sites within the genes (cp 1.23% and mp 0.99%). Positively selected sites have amino acids that are 60% hydrophilic and 40% hydrophobic in nature. We found significant evidence of positive selection at branches (cp: 0.76 and mp:1.66%) representing the diversity of MSV-A-strain in South Africa, which is related to the MSV-A-matA isolate (GenBank accession number: AF329881), well disseminated and adapted to the maize plant in sub-Saharan Africa. In the mp gene, selection significantly intensified for the overall diversities of the MSV-A sequences and those more related to the MSV-Mat-A isolate. These findings reveal that despite predominantly undergoing non-diversifying selection, the detectable diversifying positive selection observed in these genes may play a major role in MSV-A host adaptive evolution, ensuring sufficient pathogenicity for onward transmission without killing the host.
Collapse
Affiliation(s)
- Kehinde A Oyeniran
- Department of Biological Sciences, Bamidele Olumilua University of Education Science and Technology, P.M.B. 250, Ikere-Ekiti, Ekiti, Nigeria.
- Plant Systems Biology, International Centre for Genetic Engineering and Biotechnology, Cape Town, 7925, South Africa.
| | - Mobolaji O Tenibiaje
- Department of Computing and Information Science, Bamidele Olumilua University of Education Science and Technology, P.M.B. 250, Ikere-Ekiti, Ekiti, Nigeria
| |
Collapse
|
6
|
Wang C, Wang S, Zhao Y, Liu J, Zhang D, Wang F, Fan H, Li C, Jiang L. A biogeographical ancestry inference pipeline using PCA-XGBoost model and its application in Asian populations. Forensic Sci Int Genet 2025; 77:103239. [PMID: 40037006 DOI: 10.1016/j.fsigen.2025.103239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 01/09/2025] [Accepted: 02/12/2025] [Indexed: 03/06/2025]
Abstract
Biogeographical ancestry (BGA) inference plays a crucial role in genetics, anthropology, forensic science, and medical research. Current methods like principal component analysis (PCA) and ADMIXTURE, based on single nucleotide polymorphisms, are commonly used. Here, we introduce a bio-geographical ancestry inference pipeline that integrates prior population structure and clustering. Our pipeline first analyzes genetic structure on cleaned data to obtain optimal parameters and classification model labels. An XGBoost (eXtreme Gradient Boosting) classification model is constructed using principal components from PCA, and model predictions are evaluated with LR (likelihood ratio). The pipeline was applied to a dataset of Asian populations, with a first prediction accuracy of 96.27 % achieved. The LR-based evaluation accuracy reached 98.96 %, showing an improvement of 2.69 % with the introduction of LR assessment. This highlights the robust predictive capability of our pipeline and the improved accuracy in evaluation with LR. This successful application will benefit genetic research, human history studies, and criminal investigations. Additionally, the pipeline's versatility allows application to new datasets.
Collapse
Affiliation(s)
- Chunnain Wang
- School of Computer Science, Shaanxi Normal University, Xian, Shaanxi 710119, China; Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China
| | - Shuaiqi Wang
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China; School of Investigation, People's Public Security University of China, Beijing 100038, China
| | - Yiru Zhao
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China; Jiangsu International Joint Research Center of Genomics, Jiangsu Key Laboratory of Phylogenomics and Comparative Genomics, School of Life Science, Jiangsu Normal University, Xuzhou, Jiangsu 221116, China
| | - Jun Liu
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China; School of Forensic Medicine, Shanxi Medical University, Jinzhong, Shanxi 030600, China
| | - Deqin Zhang
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China; Institute of Forensic Medicine, Guizhou Medical University, Guiyang, Guizhou 550004, China
| | - Fuyang Wang
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China
| | - Hong Fan
- School of Computer Science, Shaanxi Normal University, Xian, Shaanxi 710119, China.
| | - Caixia Li
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China.
| | - Li Jiang
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing 100038, China.
| |
Collapse
|
7
|
Shastry V, Musiani M, Novembre J. Jointly representing long-range genetic similarity and spatially heterogeneous isolation-by-distance. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.10.637386. [PMID: 39990319 PMCID: PMC11844421 DOI: 10.1101/2025.02.10.637386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Isolation-by-distance patterns in genetic variation are a widespread feature of the geographic structure of genetic variation in many species, and many methods have been developed to illuminate such patterns in genetic data. However, long-range genetic similarities also exist, often as a result of rare or episodic long-range gene flow. Jointly characterizing patterns of isolation-by-distance and long-range genetic similarity in genetic data is an open data analysis challenge that, if resolved, could help produce more complete representations of the geographic structure of genetic data in any given species. Here, we present a computationally tractable method that identifies long-range genetic similarities in a background of spatially heterogeneous isolation-by-distance variation. The method uses a coalescent-based framework, and models long-range genetic similarity in terms of directional events with source fractions describing the fraction of ancestry at a location tracing back to a remote source. The method produces geographic maps annotated with inferred long-range edges, as well as maps of uncertainty in the geographic location of each source of long-range gene flow. We have implemented the method in a package called FEEMSmix (an extension to FEEMS from Marcus et al., 2021), and validated its implementation using simulations representative of typical data applications. We also apply this method to two empirical data sets. In a data set of over 4,000 humans (Homo sapiens) across Afro-Eurasia, we recover many known signals of long-distance dispersal from recent centuries. Similarly, in a data set of over 100 gray wolves (Canis lupus) across North America, we identify several previously unknown long-range connections, some of which were attributable to recording errors in sampling locations. Therefore, beyond identifying genuine long-range dispersals, our approach also serves as a useful tool for quality control in spatial genetic studies.
Collapse
Affiliation(s)
- Vivaswat Shastry
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Marco Musiani
- Department of Biological, Geological, and Environmental Sciences, University of Bologna, Bologna, Italy
| | - John Novembre
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| |
Collapse
|
8
|
Arnab SP, Dos Santos ALC, Fumagalli M, DeGiorgio M. Efficient detection and characterization of targets of natural selection using transfer learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.05.641710. [PMID: 40093065 PMCID: PMC11908262 DOI: 10.1101/2025.03.05.641710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pre-trained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps.
Collapse
Affiliation(s)
- Sandipan Paul Arnab
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | | | - Matteo Fumagalli
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK
- The Alan Turing Institute, London, UK
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
9
|
Hodgins KA, Battlay P, Bock DG. The genomic secrets of invasive plants. THE NEW PHYTOLOGIST 2025; 245:1846-1863. [PMID: 39748162 DOI: 10.1111/nph.20368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Accepted: 11/28/2024] [Indexed: 01/04/2025]
Abstract
Genomics has revolutionised the study of invasive species, allowing evolutionary biologists to dissect mechanisms of invasion in unprecedented detail. Botanical research has played an important role in these advances, driving much of what we currently know about key determinants of invasion success (e.g. hybridisation, whole-genome duplication). Despite this, a comprehensive review of plant invasion genomics has been lacking. Here, we aim to address this gap, highlighting recent discoveries that have helped progress the field. For example, by leveraging genomics in natural and experimental populations, botanical research has confirmed the importance of large-effect standing variation during adaptation in invasive species. Further, genomic investigations of plants are increasingly revealing that large structural variants, as well as genetic changes induced by whole-genome duplication such as genomic redundancy or the breakdown of dosage-sensitive reproductive barriers, can play an important role during adaptive evolution of invaders. However, numerous questions remain, including when chromosomal inversions might help or hinder invasions, whether adaptive gene reuse is common during invasions, and whether epigenetically induced mutations can underpin the adaptive evolution of plasticity in invasive populations. We conclude by highlighting these and other outstanding questions that genomic studies of invasive plants are poised to help answer.
Collapse
Affiliation(s)
- Kathryn A Hodgins
- School of Biological Sciences, Monash University, 25 Rainforest Walk, Clayton, Vic., 3800, Australia
| | - Paul Battlay
- School of Biological Sciences, Monash University, 25 Rainforest Walk, Clayton, Vic., 3800, Australia
| | - Dan G Bock
- School of Environment and Science, Griffith University, 170 Kessels Road, Nathan, Qld, 4111, Australia
| |
Collapse
|
10
|
Battlay P, Craig S, Putra AR, Monro K, De Silva NP, Wilson J, Bieker VC, Kabir S, Shamaya N, van Boheemen L, Rieseberg LH, Stinchcombe JR, Fournier-Level A, Martin MD, Hodgins KA. Rapid Parallel Adaptation in Distinct Invasions of Ambrosia Artemisiifolia Is Driven by Large-Effect Structural Variants. Mol Biol Evol 2025; 42:msae270. [PMID: 39812008 PMCID: PMC11733498 DOI: 10.1093/molbev/msae270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 11/21/2024] [Accepted: 12/17/2024] [Indexed: 01/16/2025] Open
Abstract
When introduced to multiple distinct ranges, invasive species provide a compelling natural experiment for understanding the repeatability of adaptation. Ambrosia artemisiifolia is an invasive, noxious weed, and chief cause of hay fever. Leveraging over 400 whole-genome sequences spanning the native-range in North America and 2 invasions in Europe and Australia, we inferred demographically distinct invasion histories on each continent. Despite substantial differences in genetic source and effective population size changes during introduction, scans of both local climate adaptation and divergence from the native-range revealed genomic signatures of parallel adaptation between invasions. Disproportionately represented among these parallel signatures are 37 large haploblocks-indicators of structural variation-that cover almost 20% of the genome and exist as standing genetic variation in the native-range. Many of these haploblocks are associated with traits important for adaptation to local climate, like size and the timing of flowering, and have rapidly reformed native-range clines in invaded ranges. Others show extreme frequency divergence between ranges, consistent with a response to divergent selection on different continents. Our results demonstrate the key role of large-effect standing variants in rapid adaptation during range expansion, a pattern that is robust to diverse invasion histories.
Collapse
Affiliation(s)
- Paul Battlay
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Samuel Craig
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Andhika R Putra
- School of BioSciences, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Keyne Monro
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Nissanka P De Silva
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Jonathan Wilson
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Vanessa C Bieker
- Department of Natural History, NTNU University Museum, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Saila Kabir
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Nawar Shamaya
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Lotte van Boheemen
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| | - Loren H Rieseberg
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, Canada
| | - John R Stinchcombe
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario M5S3B2, Canada
| | | | - Michael D Martin
- Department of Natural History, NTNU University Museum, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Kathryn A Hodgins
- School of Biological Sciences, Monash University, Clayton, Victoria 3800, Australia
| |
Collapse
|
11
|
Witt KE, Villanea FA. Computational Genomics and Its Applications to Anthropological Questions. AMERICAN JOURNAL OF BIOLOGICAL ANTHROPOLOGY 2024; 186 Suppl 78:e70010. [PMID: 40071816 PMCID: PMC11898561 DOI: 10.1002/ajpa.70010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 10/14/2024] [Accepted: 12/19/2024] [Indexed: 03/15/2025]
Abstract
The advent of affordable genome sequencing and the development of new computational tools have established a new era of genomic knowledge. Sequenced human genomes number in the tens of thousands, including thousands of ancient human genomes. The abundance of data has been met with new analysis tools that can be used to understand populations' demographic and evolutionary histories. Thus, a variety of computational methods now exist that can be leveraged to answer anthropological questions. This includes novel likelihood and Bayesian methods, machine learning techniques, and a vast array of population simulators. These computational tools provide powerful insights gained from genomic datasets, although they are generally inaccessible to those with less computational experience. Here, we outline the theoretical workings behind computational genomics methods, limitations and other considerations when applying these computational methods, and examples of how computational methods have already been applied to anthropological questions. We hope this review will empower other anthropologists to utilize these powerful tools in their own research.
Collapse
Affiliation(s)
- Kelsey E. Witt
- Department of Genetics and Biochemistry and Center for Human GeneticsClemson UniversityClemsonSouth CarolinaUSA
| | | |
Collapse
|
12
|
Osmond M, Coop G. Estimating dispersal rates and locating genetic ancestors with genome-wide genealogies. eLife 2024; 13:e72177. [PMID: 39589398 DOI: 10.7554/elife.72177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Accepted: 11/24/2024] [Indexed: 11/27/2024] Open
Abstract
Spatial patterns in genetic diversity are shaped by individuals dispersing from their parents and larger-scale population movements. It has long been appreciated that these patterns of movement shape the underlying genealogies along the genome leading to geographic patterns of isolation-by-distance in contemporary population genetic data. However, extracting the enormous amount of information contained in genealogies along recombining sequences has, until recently, not been computationally feasible. Here, we capitalize on important recent advances in genome-wide gene-genealogy reconstruction and develop methods to use thousands of trees to estimate per-generation dispersal rates and to locate the genetic ancestors of a sample back through time. We take a likelihood approach in continuous space using a simple approximate model (branching Brownian motion) as our prior distribution of spatial genealogies. After testing our method with simulations we apply it to Arabidopsis thaliana. We estimate a dispersal rate of roughly 60 km2/generation, slightly higher across latitude than across longitude, potentially reflecting a northward post-glacial expansion. Locating ancestors allows us to visualize major geographic movements, alternative geographic histories, and admixture. Our method highlights the huge amount of information about past dispersal events and population movements contained in genome-wide genealogies.
Collapse
Affiliation(s)
- Matthew Osmond
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| | - Graham Coop
- Department of Evolution & Ecology and Center for Population Biology, University of California, Davis, Davis, United States
| |
Collapse
|
13
|
Whitehouse LS, Ray DD, Schrider DR. Tree Sequences as a General-Purpose Tool for Population Genetic Inference. Mol Biol Evol 2024; 41:msae223. [PMID: 39460991 PMCID: PMC11600592 DOI: 10.1093/molbev/msae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 10/05/2024] [Accepted: 10/17/2024] [Indexed: 10/28/2024] Open
Abstract
As population genetic data increase in size, new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks applied to population genetic alignments. To better utilize these new data structures, we propose and implement a graph convolutional network to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard convolutional neural network approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a graph convolutional network approach and can be used to perform well on these common population genetic inference tasks with accuracies roughly matching or even exceeding that of a convolutional neural network-based method. As tree sequences become more widely used in population genetic research, we foresee developments and optimizations of this work to provide a foundation for population genetic inference moving forward.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Dylan D Ray
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
14
|
Hernández F, Vercellino RB, Todesco M, Bercovich N, Alvarez D, Brunet J, Presotto A, Rieseberg LH. Admixture With Cultivated Sunflower Likely Facilitated Establishment and Spread of Wild Sunflower (Helianthus annuus) in Argentina. Mol Ecol 2024; 33:e17560. [PMID: 39422702 DOI: 10.1111/mec.17560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 08/20/2024] [Accepted: 08/29/2024] [Indexed: 10/19/2024]
Abstract
A better understanding of the genetic and ecological factors underlying successful invasions is critical to mitigate the negative impacts of invasive species. Here, we study the invasion history of Helianthus annuus populations from Argentina, with particular emphasis on the role of post-introduction admixture with cultivated sunflower (also H. annuus) and climate adaptation driven by large haploblocks. We conducted genotyping-by-sequencing of samples of wild populations as well as Argentinian cultivars and compared them with wild (including related annual Helianthus species) and cultivated samples from the native range. We also characterised samples for 11 known haploblocks associated with environmental variation in native populations to test whether haploblocks contributed to invasion success. Population genomics analyses supported two independent geographic sources for Argentinian populations, the central United States and Texas, but no significant contribution of related annual Helianthus species. We found pervasive admixture with cultivated sunflower, likely as result of post-introduction hybridization. Genomic scans between invasive populations and their native sources identified multiple genomic regions of divergence, possibly indicative of selection, in the invaded range. These regions significantly overlapped between the two native-invasive comparisons and showed disproportionally high crop ancestry, suggesting that crop alleles contributed to invasion success. We did not find evidence of climate adaptation mediated by haploblocks, yet outliers of genome scans were enriched in haploblock regions and, for at least two haploblocks, the cultivar haplotype was favoured in Argentina. Our results show that admixture with cultivated sunflower played a major role in the establishment and spread of H. annuus populations in Argentina.
Collapse
Affiliation(s)
- Fernando Hernández
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
| | - Román B Vercellino
- Departamento de Agronomía, CERZOS, Universidad Nacional del Sur (UNS)-CONICET, Bahía Blanca, Argentina
| | - Marco Todesco
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
- Irving K. Barber Faculty of Science, University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Natalia Bercovich
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
| | - Daniel Alvarez
- Estación Experimental Agropecuaria INTA Manfredi, Córdoba, Argentina
| | - Johanne Brunet
- Vegetable Crops Research Unit, USDA-ARS, Madison, Wisconsin, USA
| | - Alejandro Presotto
- Departamento de Agronomía, CERZOS, Universidad Nacional del Sur (UNS)-CONICET, Bahía Blanca, Argentina
| | - Loren H Rieseberg
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
15
|
Whitehouse LS, Ray D, Schrider DR. Tree sequences as a general-purpose tool for population genetic inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581288. [PMID: 39185244 PMCID: PMC11343121 DOI: 10.1101/2024.02.20.581288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
As population genetics data increases in size new methods have been developed to store genetic information in efficient ways, such as tree sequences. These data structures are computationally and storage efficient, but are not interchangeable with existing data structures used for many population genetic inference methodologies such as the use of convolutional neural networks (CNNs) applied to population genetic alignments. To better utilize these new data structures we propose and implement a graph convolutional network (GCN) to directly learn from tree sequence topology and node data, allowing for the use of neural network applications without an intermediate step of converting tree sequences to population genetic alignment format. We then compare our approach to standard CNN approaches on a set of previously defined benchmarking tasks including recombination rate estimation, positive selection detection, introgression detection, and demographic model parameter inference. We show that tree sequences can be directly learned from using a GCN approach and can be used to perform well on these common population genetics inference tasks with accuracies roughly matching or even exceeding that of a CNN-based method. As tree sequences become more widely used in population genetics research we foresee developments and optimizations of this work to provide a foundation for population genetics inference moving forward.
Collapse
Affiliation(s)
- Logan S. Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Dylan Ray
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA, 120 Mason Farm Rd, Chapel Hill, NC 27514
| |
Collapse
|
16
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. Mol Ecol Resour 2024; 24:e14005. [PMID: 39152666 DOI: 10.1111/1755-0998.14005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 07/16/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity-by-descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN.
Collapse
Affiliation(s)
- Chris C R Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Gilia Patterson
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, Oregon, USA
| |
Collapse
|
17
|
Kumar H, Qin X, Bhushan B, Dutt T, Panigrahi M. DeepGenomeScan of 15 Worldwide Bovine Populations Detects Spatially Varying Positive Selection Signals. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2024; 28:504-513. [PMID: 39315920 DOI: 10.1089/omi.2024.0154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Identifying genomic regions under selection is essential for understanding the genetic mechanisms driving species evolution and adaptation. Traditional methods often fall short in detecting complex, spatially varying selection signals. Recent advances in deep learning, however, present promising new approaches for uncovering subtle selection signals that traditional methods might miss. In this study, we utilized the deep learning framework DeepGenomeScan to detect spatially varying selection signatures across 15 bovine populations worldwide. Our analysis uncovered novel insights into selective sweep hotspots within the bovine genome, revealing key genes associated with physiological and adaptive traits that were previously undetected. We identified significant quantitative trait loci linked to milk protein and fat percentages. By comparing the selection signatures identified in this study with those reported in the Bovine Genome Variation Database, we discovered 38 novel genes under selection that were not identified through traditional methods. These genes are primarily associated with milk and meat yield and quality. Our findings enhance our understanding of spatially varying selection's impact on bovine genomic diversity, laying a foundation for future research in genetic improvement and conservation. This is the first deep learning-based study of selection signatures in cattle, offering new insights for evolutionary and livestock genomics research.
Collapse
Affiliation(s)
- Harshit Kumar
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
- ICAR-National Research Centre on Mithun, Medziphema, India
| | - Xinghu Qin
- School of Ecology and Nature Conservation, Beijing Forestry University, Beijing, China
| | - Bharat Bhushan
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
| | - Triveni Dutt
- Indian Veterinary Research Institute, Izatnagar, India
| | - Manjit Panigrahi
- Division of Animal Genetics, Indian Veterinary Research Institute, Izatnagar, India
| |
Collapse
|
18
|
Gillespie LE, Ruffley M, Exposito-Alonso M. Deep learning models map rapid plant species changes from citizen science and remote sensing data. Proc Natl Acad Sci U S A 2024; 121:e2318296121. [PMID: 39236239 PMCID: PMC11406280 DOI: 10.1073/pnas.2318296121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 07/17/2024] [Indexed: 09/07/2024] Open
Abstract
Anthropogenic habitat destruction and climate change are reshaping the geographic distribution of plants worldwide. However, we are still unable to map species shifts at high spatial, temporal, and taxonomic resolution. Here, we develop a deep learning model trained using remote sensing images from California paired with half a million citizen science observations that can map the distribution of over 2,000 plant species. Our model-Deepbiosphere-not only outperforms many common species distribution modeling approaches (AUC 0.95 vs. 0.88) but can map species at up to a few meters resolution and finely delineate plant communities with high accuracy, including the pristine and clear-cut forests of Redwood National Park. These fine-scale predictions can further be used to map the intensity of habitat fragmentation and sharp ecosystem transitions across human-altered landscapes. In addition, from frequent collections of remote sensing data, Deepbiosphere can detect the rapid effects of severe wildfire on plant community composition across a 2-y time period. These findings demonstrate that integrating public earth observations and citizen science with deep learning can pave the way toward automated systems for monitoring biodiversity change in real-time worldwide.
Collapse
Affiliation(s)
- Lauren E. Gillespie
- Department of Plant Biology, Carnegie Science, Stanford, CA94305
- Department of Computer Science, Stanford University, Stanford, CA94305
- Department of Integrative Biology, University of California, Berkeley, CA94720
| | - Megan Ruffley
- Department of Plant Biology, Carnegie Science, Stanford, CA94305
| | - Moises Exposito-Alonso
- Department of Plant Biology, Carnegie Science, Stanford, CA94305
- Department of Integrative Biology, University of California, Berkeley, CA94720
- Department of Biology, Stanford University, Stanford, CA94305
- Department of Global Ecology, Carnegie Science, Stanford, CA94305
- HHMI, University of California, Berkeley, CA94720
| |
Collapse
|
19
|
Giglio RM, Bowden CF, Brook RK, Piaggio AJ, Smyser TJ. Characterizing feral swine movement across the contiguous United States using neural networks and genetic data. Mol Ecol 2024; 33:e17489. [PMID: 39148259 DOI: 10.1111/mec.17489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 07/03/2024] [Accepted: 07/09/2024] [Indexed: 08/17/2024]
Abstract
Globalization has led to the frequent movement of species out of their native habitat. Some of these species become highly invasive and capable of profoundly altering invaded ecosystems. Feral swine (Sus scrofa × domesticus) are recognized as being among the most destructive invasive species, with populations established on all continents except Antarctica. Within the United States (US), feral swine are responsible for extensive crop damage, the destruction of native ecosystems, and the spread of disease. Purposeful human-mediated movement of feral swine has contributed to their rapid range expansion over the past 30 years. Patterns of deliberate introduction of feral swine have not been well described as populations may be established or augmented through small, undocumented releases. By leveraging an extensive genomic database of 18,789 samples genotyped at 35,141 single nucleotide polymorphisms (SNPs), we used deep neural networks to identify translocated feral swine across the contiguous US. We classified 20% (3364/16,774) of sampled animals as having been translocated and described general patterns of translocation using measures of centrality in a network analysis. These findings unveil extensive movement of feral swine well beyond their dispersal capabilities, including individuals with predicted origins >1000 km away from their sampling locations. Our study provides insight into the patterns of human-mediated movement of feral swine across the US and from Canada to the northern areas of the US. Further, our study validates the use of neural networks for studying the spread of invasive species.
Collapse
Affiliation(s)
- Rachael M Giglio
- United States Department of Agriculture, Animal and Plant Health Inspection Service, Wildlife Services, National Wildlife Research Center, Fort Collins, Colorado, USA
| | - Courtney F Bowden
- United States Department of Agriculture, Animal and Plant Health Inspection Service, Wildlife Services, National Wildlife Research Center, Fort Collins, Colorado, USA
| | - Ryan K Brook
- Department of Animal and Poultry Science, College of Agriculture and Bioresources, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Antoinette J Piaggio
- United States Department of Agriculture, Animal and Plant Health Inspection Service, Wildlife Services, National Wildlife Research Center, Fort Collins, Colorado, USA
| | - Timothy J Smyser
- United States Department of Agriculture, Animal and Plant Health Inspection Service, Wildlife Services, National Wildlife Research Center, Fort Collins, Colorado, USA
| |
Collapse
|
20
|
Faraggi E, Jernigan RL, Kloczkowski A. Rapid discrimination between deleterious and benign missense mutations in the CAGI 6 experiment. Hum Genomics 2024; 18:89. [PMID: 39192324 PMCID: PMC11350969 DOI: 10.1186/s40246-024-00655-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/08/2024] [Indexed: 08/29/2024] Open
Abstract
We describe the machine learning tool that we applied in the CAGI 6 experiment to predict whether single residue mutations in proteins are deleterious or benign. This tool was trained using only single sequences, i.e., without multiple sequence alignments or structural information. Instead, we used global characterizations of the protein sequence. Training and testing data for human gene mutations was obtained from ClinVar (ncbi.nlm.nih.gov/pub/ClinVar/), and for non-human gene mutations from Uniprot (www.uniprot.org). Testing was done on post-training data from ClinVar. This testing yielded high AUC and Matthews correlation coefficient (MCC) for well trained examples but low generalizability. For genes with either sparse or unbalanced training data, the prediction accuracy is poor. The resulting prediction server is available online at http://www.mamiris.com/Shoni.cagi6.
Collapse
Affiliation(s)
- Eshel Faraggi
- Research and Information Systems, LLC, 1620 E. 72nd ST., Indianapolis, IN, 46240, USA.
- Physics Department, Indiana University Purdue University Indianapolis, Indianapolis, IN, 46202, USA.
| | - Robert L Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA, 50011, USA
| | - Andrzej Kloczkowski
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Columbus, OH, 43205, USA
- Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA
- Department of Pediatrics, The Ohio State University, Columbus, OH, 43205, USA
| |
Collapse
|
21
|
Song M, Zhou Y, Zhao C, Song F, Hou Y. YHP: Y-chromosome Haplogroup Predictor for predicting male lineages based on Y-STRs. Forensic Sci Int 2024; 361:112113. [PMID: 38936202 DOI: 10.1016/j.forsciint.2024.112113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 05/24/2024] [Accepted: 06/16/2024] [Indexed: 06/29/2024]
Abstract
Human Y chromosome reflects the evolutionary process of males. Male lineage tracing by Y chromosome is of great use in evolutionary, forensic, and anthropological studies. Identifying the male lineage based on the specific distribution of Y haplogroups narrows down the investigation scope, which has been used in forensic scenarios. However, existing software aids in familial searching using Y-STRs (Y-chromosome short tandem repeats) to predict Y-SNP (Y-chromosome single nucleotide polymorphism) haplogroups, they often lack resolution. In this study, we developed YHP (Y Haplogroup Predictor), a novel software offering high-resolution haplogroup inference without requiring extensive Y-SNP sequencing. Leveraging existing datasets (219 haplogroups, 4064 samples in total), YHP predicts haplogroups with 0.923 accuracy under the highest haplogroup resolution, employing a random forest algorithm. YHP, available on Github (https://github.com/cissy123/YHP-Y-Haplogroup-Predictor-), facilitates high-resolution haplogroup prediction, haplotype mismatch analysis, and haplotype similarity comparison. Notably, it demonstrates efficacy in East Asian populations, benefiting from training data from eight distinct East Asian ethnic populations. Moreover, it enables seamless integration of additional training sets, extending its utility to diverse populations.
Collapse
Affiliation(s)
- Mengyuan Song
- Department of Forensic Genetics, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China; Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Yuxiang Zhou
- Department of Forensic Genetics, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China
| | - Chenxi Zhao
- College of Computer Science, Sichuan University, Chengdu, China
| | - Feng Song
- Department of Forensic Genetics, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China.
| | - Yiping Hou
- Department of Forensic Genetics, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China.
| |
Collapse
|
22
|
Hong S, Choi YA, Joo DS, Gürsoy G. Privacy-preserving model evaluation for logistic and linear regression using homomorphically encrypted genotype data. J Biomed Inform 2024; 156:104678. [PMID: 38936565 PMCID: PMC11272436 DOI: 10.1016/j.jbi.2024.104678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/29/2024] [Accepted: 06/19/2024] [Indexed: 06/29/2024]
Abstract
OBJECTIVE Linear and logistic regression are widely used statistical techniques in population genetics for analyzing genetic data and uncovering patterns and associations in large genetic datasets, such as identifying genetic variations linked to specific diseases or traits. However, obtaining statistically significant results from these studies requires large amounts of sensitive genotype and phenotype information from thousands of patients, which raises privacy concerns. Although cryptographic techniques such as homomorphic encryption offers a potential solution to the privacy concerns as it allows computations on encrypted data, previous methods leveraging homomorphic encryption have not addressed the confidentiality of shared models, which can leak information about the training data. METHODS In this work, we present a secure model evaluation method for linear and logistic regression using homomorphic encryption for six prediction tasks, where input genotypes, output phenotypes, and model parameters are all encrypted. RESULTS Our method ensures no private information leakage during inference and achieves high accuracy (≥93% for all outcomes) with each inference taking less than ten seconds for ∼200 genomes. CONCLUSION Our study demonstrates that it is possible to perform linear and logistic regression model evaluation while protecting patient confidentiality with theoretical security guarantees. Our implementation and test data are available at https://github.com/G2Lab/privateML/.
Collapse
Affiliation(s)
- Seungwan Hong
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Yoolim A Choi
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Daniel S Joo
- New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
23
|
Sandercock AM, Westbrook JW, Zhang Q, Holliday JA. A genome-guided strategy for climate resilience in American chestnut restoration populations. Proc Natl Acad Sci U S A 2024; 121:e2403505121. [PMID: 39012830 PMCID: PMC11287244 DOI: 10.1073/pnas.2403505121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 06/11/2024] [Indexed: 07/18/2024] Open
Abstract
American chestnut (Castanea dentata) is a deciduous tree species of eastern North America that was decimated by the introduction of the chestnut blight fungus (Cryphonectria parasitica) in the early 20th century. Although millions of American chestnuts survive as root collar sprouts, these trees rarely reproduce. Thus, the species is considered functionally extinct. American chestnuts with improved blight resistance have been developed through interspecific hybridization followed by conspecific backcrossing, and by genetic engineering. Incorporating adaptive genomic diversity into these backcross families and transgenic lines is important for restoring the species across broad climatic gradients. To develop sampling recommendations for ex situ conservation of wild adaptive genetic variation, we coupled whole-genome resequencing of 384 stump sprouts with genotype-environment association analyses and found that the species range can be subdivided into three seed zones characterized by relatively homogeneous adaptive allele frequencies. We estimated that 21 to 29 trees per seed zone will need to be conserved to capture most extant adaptive diversity. We also resequenced the genomes of 269 backcross trees to understand the extent to which the breeding program has already captured wild adaptive diversity, and to estimate optimal reintroduction sites for specific families on the basis of their adaptive portfolio and future climate projections. Taken together, these results inform the development of an ex situ germplasm conservation and breeding plan to target blight-resistant breeding populations to specific environments and provides a blueprint for developing restoration plans for other imperiled tree species.
Collapse
Affiliation(s)
| | | | - Qian Zhang
- Department of Forest Resources and Environmental Conservation, Virginia Tech,Blacksburg, VA24060
| | - Jason A. Holliday
- Department of Forest Resources and Environmental Conservation, Virginia Tech,Blacksburg, VA24060
| |
Collapse
|
24
|
Smith CCR, Patterson G, Ralph PL, Kern AD. Estimation of spatial demographic maps from polymorphism data using a neural network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585300. [PMID: 38559192 PMCID: PMC10980082 DOI: 10.1101/2024.03.15.585300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
A fundamental goal in population genetics is to understand how variation is arrayed over natural landscapes. From first principles we know that common features such as heterogeneous population densities and barriers to dispersal should shape genetic variation over space, however there are few tools currently available that can deal with these ubiquitous complexities. Geographically referenced single nucleotide polymorphism (SNP) data are increasingly accessible, presenting an opportunity to study genetic variation across geographic space in myriad species. We present a new inference method that uses geo-referenced SNPs and a deep neural network to estimate spatially heterogeneous maps of population density and dispersal rate. Our neural network trains on simulated input and output pairings, where the input consists of genotypes and sampling locations generated from a continuous space population genetic simulator, and the output is a map of the true demographic parameters. We benchmark our tool against existing methods and discuss qualitative differences between the different approaches; in particular, our program is unique because it infers the magnitude of both dispersal and density as well as their variation over the landscape, and it does so using SNP data. Similar methods are constrained to estimating relative migration rates, or require identity by descent blocks as input. We applied our tool to empirical data from North American grey wolves, for which it estimated mostly reasonable demographic parameters, but was affected by incomplete spatial sampling. Genetic based methods like ours complement other, direct methods for estimating past and present demography, and we believe will serve as valuable tools for applications in conservation, ecology, and evolutionary biology. An open source software package implementing our method is available from https://github.com/kr-colab/mapNN .
Collapse
|
25
|
Lasky JR, Takou M, Gamba D, Keitt TH. Estimating scale-specific and localized spatial patterns in allele frequency. Genetics 2024; 227:iyae082. [PMID: 38758968 PMCID: PMC11339607 DOI: 10.1093/genetics/iyae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 09/07/2023] [Accepted: 04/28/2024] [Indexed: 05/19/2024] Open
Abstract
Characterizing spatial patterns in allele frequencies is fundamental to evolutionary biology because these patterns contain evidence of underlying processes. However, the spatial scales at which gene flow, changing selection, and drift act are often unknown. Many of these processes can operate inconsistently across space, causing nonstationary patterns. We present a wavelet approach to characterize spatial pattern in allele frequency that helps solve these problems. We show how our approach can characterize spatial patterns in relatedness at multiple spatial scales, i.e. a multilocus wavelet genetic dissimilarity. We also develop wavelet tests of spatial differentiation in allele frequency and quantitative trait loci (QTL). With simulation, we illustrate these methods under different scenarios. We also apply our approach to natural populations of Arabidopsis thaliana to characterize population structure and identify locally adapted loci across scales. We find, for example, that Arabidopsis flowering time QTL show significantly elevated genetic differentiation at 300-1,300 km scales. Wavelet transforms of allele frequencies offer a flexible way to reveal geographic patterns and underlying evolutionary processes.
Collapse
Affiliation(s)
- Jesse R Lasky
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Margarita Takou
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Diana Gamba
- Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
| | - Timothy H Keitt
- Department of Integrative Biology, University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
26
|
Librado P, Tressières G, Chauvey L, Fages A, Khan N, Schiavinato S, Calvière-Tonasso L, Kusliy MA, Gaunitz C, Liu X, Wagner S, Der Sarkissian C, Seguin-Orlando A, Perdereau A, Aury JM, Southon J, Shapiro B, Bouchez O, Donnadieu C, Collin YRH, Gregersen KM, Jessen MD, Christensen K, Claudi-Hansen L, Pruvost M, Pucher E, Vulic H, Novak M, Rimpf A, Turk P, Reiter S, Brem G, Schwall C, Barrey É, Robert C, Degueurce C, Horwitz LK, Klassen L, Rasmussen U, Kveiborg J, Johannsen NN, Makowiecki D, Makarowicz P, Szeliga M, Ilchyshyn V, Rud V, Romaniszyn J, Mullin VE, Verdugo M, Bradley DG, Cardoso JL, Valente MJ, Telles Antunes M, Ameen C, Thomas R, Ludwig A, Marzullo M, Prato O, Bagnasco Gianni G, Tecchiati U, Granado J, Schlumbaum A, Deschler-Erb S, Mráz MS, Boulbes N, Gardeisen A, Mayer C, Döhle HJ, Vicze M, Kosintsev PA, Kyselý R, Peške L, O'Connor T, Ananyevskaya E, Shevnina I, Logvin A, Kovalev AA, Iderkhangai TO, Sablin MV, Dashkovskiy PK, Graphodatsky AS, Merts I, Merts V, Kasparov AK, Pitulko VV, Onar V, Öztan A, Arbuckle BS, McColl H, Renaud G, Khaskhanov R, Demidenko S, Kadieva A, Atabiev B, Sundqvist M, Lindgren G, López-Cachero FJ, Albizuri S, Trbojević Vukičević T, Rapan Papeša A, et alLibrado P, Tressières G, Chauvey L, Fages A, Khan N, Schiavinato S, Calvière-Tonasso L, Kusliy MA, Gaunitz C, Liu X, Wagner S, Der Sarkissian C, Seguin-Orlando A, Perdereau A, Aury JM, Southon J, Shapiro B, Bouchez O, Donnadieu C, Collin YRH, Gregersen KM, Jessen MD, Christensen K, Claudi-Hansen L, Pruvost M, Pucher E, Vulic H, Novak M, Rimpf A, Turk P, Reiter S, Brem G, Schwall C, Barrey É, Robert C, Degueurce C, Horwitz LK, Klassen L, Rasmussen U, Kveiborg J, Johannsen NN, Makowiecki D, Makarowicz P, Szeliga M, Ilchyshyn V, Rud V, Romaniszyn J, Mullin VE, Verdugo M, Bradley DG, Cardoso JL, Valente MJ, Telles Antunes M, Ameen C, Thomas R, Ludwig A, Marzullo M, Prato O, Bagnasco Gianni G, Tecchiati U, Granado J, Schlumbaum A, Deschler-Erb S, Mráz MS, Boulbes N, Gardeisen A, Mayer C, Döhle HJ, Vicze M, Kosintsev PA, Kyselý R, Peške L, O'Connor T, Ananyevskaya E, Shevnina I, Logvin A, Kovalev AA, Iderkhangai TO, Sablin MV, Dashkovskiy PK, Graphodatsky AS, Merts I, Merts V, Kasparov AK, Pitulko VV, Onar V, Öztan A, Arbuckle BS, McColl H, Renaud G, Khaskhanov R, Demidenko S, Kadieva A, Atabiev B, Sundqvist M, Lindgren G, López-Cachero FJ, Albizuri S, Trbojević Vukičević T, Rapan Papeša A, Burić M, Rajić Šikanjić P, Weinstock J, Asensio Vilaró D, Codina F, García Dalmau C, Morer de Llorens J, Pou J, de Prado G, Sanmartí J, Kallala N, Torres JR, Maraoui-Telmini B, Belarte Franco MC, Valenzuela-Lamas S, Zazzo A, Lepetz S, Duchesne S, Alexeev A, Bayarsaikhan J, Houle JL, Bayarkhuu N, Turbat T, Crubézy É, Shingiray I, Mashkour M, Berezina NY, Korobov DS, Belinskiy A, Kalmykov A, Demoule JP, Reinhold S, Hansen S, Wallner B, Roslyakova N, Kuznetsov PF, Tishkin AA, Wincker P, Kanne K, Outram A, Orlando L. Widespread horse-based mobility arose around 2200 BCE in Eurasia. Nature 2024; 631:819-825. [PMID: 38843826 PMCID: PMC11269178 DOI: 10.1038/s41586-024-07597-5] [Show More Authors] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 05/23/2024] [Indexed: 07/19/2024]
Abstract
Horses revolutionized human history with fast mobility1. However, the timeline between their domestication and their widespread integration as a means of transport remains contentious2-4. Here we assemble a collection of 475 ancient horse genomes to assess the period when these animals were first reshaped by human agency in Eurasia. We find that reproductive control of the modern domestic lineage emerged around 2200 BCE, through close-kin mating and shortened generation times. Reproductive control emerged following a severe domestication bottleneck starting no earlier than approximately 2700 BCE, and coincided with a sudden expansion across Eurasia that ultimately resulted in the replacement of nearly every local horse lineage. This expansion marked the rise of widespread horse-based mobility in human history, which refutes the commonly held narrative of large horse herds accompanying the massive migration of steppe peoples across Europe around 3000 BCE and earlier3,5. Finally, we detect significantly shortened generation times at Botai around 3500 BCE, a settlement from central Asia associated with corrals and a subsistence economy centred on horses6,7. This supports local horse husbandry before the rise of modern domestic bloodlines.
Collapse
Affiliation(s)
- Pablo Librado
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France.
- Institut de Biologia Evolutiva (CSIC - Universitat Pompeu Fabra), Barcelona, Spain.
| | - Gaetan Tressières
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Lorelei Chauvey
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Antoine Fages
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Zoological institute, Department of Environmental Sciences, University of Basel, Basel, Switzerland
| | - Naveed Khan
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Department of Biotechnology, Abdul Wali Khan University, Mardan, Pakistan
| | - Stéphanie Schiavinato
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Laure Calvière-Tonasso
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Mariya A Kusliy
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Department of the Diversity and Evolution of Genomes, Institute of Molecular and Cellular Biology, Novosibirsk, Russia
| | - Charleen Gaunitz
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Xuexue Liu
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Stefanie Wagner
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- INRAE Division Ecology and Biodiversity (ECODIV), Plant Genomic Resources Center (CNRGV), Castanet Tolosan Cedex, France
| | - Clio Der Sarkissian
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Andaine Seguin-Orlando
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Aude Perdereau
- Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Université d'Évry, Université Paris-Saclay, Évry, France
| | - Jean-Marc Aury
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université d'Évry, Université Paris-Saclay, Évry, France
| | - John Southon
- Department of Earth System Science, University of California, Irvine, CA, USA
| | - Beth Shapiro
- Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
| | | | | | - Yvette Running Horse Collin
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Taku Skan Skan Wasakliyapi: Global Institute for Traditional Sciences, Rapid City, SD, USA
| | | | - Mads Dengsø Jessen
- Department for Prehistory Middle Ages and Renaissance, National Museum of Denmark, Copenhagen K, Denmark
| | | | | | - Mélanie Pruvost
- UMR 5199 De la Préhistoire à l'Actuel: Culture, Environnement et Anthropologie (PACEA), CNRS, Université de Bordeaux, Pessac Cédex, France
| | | | | | - Mario Novak
- Centre for Applied Bioanthropology, Institute for Anthropological Research, Zagreb, Croatia
| | | | - Peter Turk
- Narodni muzej Slovenije, Ljubljana, Slovenia
| | - Simone Reiter
- Institute of Animal Breeding and Genetics, Department of Biomedical Sciences, University of Veterinary Medicine Vienna, Vienna, Austria
| | - Gottfried Brem
- Institute of Animal Breeding and Genetics, Department of Biomedical Sciences, University of Veterinary Medicine Vienna, Vienna, Austria
| | - Christoph Schwall
- Leibniz-Zentrum für Archäologie (LEIZA), Mainz, Germany
- Department of Prehistory & Western Asian/Northeast African Archaeology, Austrian Archaeological Institute (OeAI), Austrian Academy of Sciences (OeAW), Vienna, Austria
| | - Éric Barrey
- Université Paris-Saclay, AgroParisTech, INRAE GABI UMR1313, Jouy-en-Josas, France
| | - Céline Robert
- Université Paris-Saclay, AgroParisTech, INRAE GABI UMR1313, Jouy-en-Josas, France
- Ecole Nationale Vétérinaire d'Alfort, Maisons-Alfort, France
| | | | - Liora Kolska Horwitz
- National Natural History Collections, Edmond J. Safra Campus, Givat Ram, The Hebrew University, Jerusalem, Israel
| | | | - Uffe Rasmussen
- Department of Archaeology, Moesgaard Museum, Højbjerg, Denmark
| | - Jacob Kveiborg
- Department of Archaeological Science and Conservation, Moesgaard Museum, Højbjerg, Denmark
| | | | - Daniel Makowiecki
- Institute of Archaeology, Faculty of History, Nicolaus Copernicus University, Toruń, Poland
| | | | - Marcin Szeliga
- Institute of Archaeology, Maria Curie-Skłodowska University, Lublin, Poland
| | - Vasyl Ilchyshyn
- Kremenetsko-Pochaivskii Derzhavnyi Istoriko-arkhitekturnyi Zapovidnik, Kremenets, Ukraine
| | - Vitalii Rud
- Institute of Archaeology, National Academy of Sciences of Ukraine, Kyiv, Ukraine
| | - Jan Romaniszyn
- Faculty of Archaeology, Adam Mickiewicz University, Poznań, Poland
| | - Victoria E Mullin
- Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
| | - Marta Verdugo
- Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
| | - Daniel G Bradley
- Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
| | - João L Cardoso
- ICArEHB, Campus de Gambelas, University of Algarve, Faro, Portugal
- Universidade Aberta, Lisbon, Portugal
| | - Maria J Valente
- Faculdade de Ciências Humanas e Sociais, Centro de Estudos de Arqueologia, Artes e Ciências do Património, Universidade do Algarve, Faro, Portugal
| | - Miguel Telles Antunes
- Centre for Research on Science and Geological Engineering, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Carly Ameen
- Department of Archaeology and History, University of Exeter, Exeter, UK
| | - Richard Thomas
- School of Archaeology and Ancient History, University of Leicester, Leicester, UK
| | - Arne Ludwig
- Department of Evolutionary Genetics, Leibniz-Institute for Zoo and Wildlife Research, Berlin, Germany
- Albrecht Daniel Thaer-Institute, Faculty of Life Sciences, Humboldt University Berlin, Berlin, Germany
| | - Matilde Marzullo
- Dipartimento di Beni Culturali e Ambientali, Università degli Studi di Milano, Milan, Italy
| | - Ornella Prato
- Dipartimento di Beni Culturali e Ambientali, Università degli Studi di Milano, Milan, Italy
| | | | - Umberto Tecchiati
- Dipartimento di Beni Culturali e Ambientali, Università degli Studi di Milano, Milan, Italy
| | - José Granado
- Department of Environmental Sciences, Integrative Prehistory and Archaeological Science, Basel University, Basel, Switzerland
| | - Angela Schlumbaum
- Department of Environmental Sciences, Integrative Prehistory and Archaeological Science, Basel University, Basel, Switzerland
| | - Sabine Deschler-Erb
- Department of Environmental Sciences, Integrative Prehistory and Archaeological Science, Basel University, Basel, Switzerland
| | - Monika Schernig Mráz
- Department of Environmental Sciences, Integrative Prehistory and Archaeological Science, Basel University, Basel, Switzerland
| | - Nicolas Boulbes
- Institut de Paléontologie Humaine, Fondation Albert Ier, Paris/UMR 7194 HNHP, MNHN-CNRS-UPVD/EPCC Centre Européen de Recherche Préhistorique, Tautavel, France
| | - Armelle Gardeisen
- Archéologie des Sociétés Méditeranéennes, Archimède IA-ANR-11-LABX-0032-01, CNRS UMR 5140, Université Paul Valéry, Montpellier, France
| | - Christian Mayer
- Department for Digitalization and Knowledge Transfer, Federal Monuments Authority Austria, Vienna, Austria
| | - Hans-Jürgen Döhle
- Landesamt für Denkmalpflege und Archäologie Sachsen-Anhalt - Landesmuseum für Vorgeschichte, Halle (Saale), Germany
| | - Magdolna Vicze
- National Institute of Archaeology, Hungarian National Museum, Budapest, Hungary
| | - Pavel A Kosintsev
- Paleoecology Laboratory, Institute of Plant and Animal Ecology, Ural Branch of the Russian Academy of Sciences, Ekaterinburg, Russia
- Department of History of the Institute of Humanities, Ural Federal University, Ekaterinburg, Russia
| | - René Kyselý
- Department of Natural Sciences and Archaeometry, Institute of Archaeology of the Czech Academy of Sciences, Prague, Czechia
| | | | | | - Elina Ananyevskaya
- Department of Archaeology, History Faculty, Vilnius University, Vilnius, Lithuania
| | - Irina Shevnina
- Laboratory for Archaeological Research, Akhmet Baitursynuly Kostanay Regional University, Kostanay, Kazakhstan
| | - Andrey Logvin
- Laboratory for Archaeological Research, Akhmet Baitursynuly Kostanay Regional University, Kostanay, Kazakhstan
| | - Alexey A Kovalev
- Department of Archaeological Heritage Preservation, Institute of Archaeology of the Russian Academy of Sciences, Moscow, Russia
| | - Tumur-Ochir Iderkhangai
- Department of Innovation and Technology, Ulaanbaatar Science and Technology Park, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Mikhail V Sablin
- Zoological Institute, Russian Academy of Sciences, St Petersburg, Russia
| | - Petr K Dashkovskiy
- Department of Russian Regional Studies, National and State-confessional Relations, Altai State University, Barnaul, Russia
| | - Alexander S Graphodatsky
- Department of the Diversity and Evolution of Genomes, Institute of Molecular and Cellular Biology, Novosibirsk, Russia
| | - Ilia Merts
- Toraighyrov University, Joint Research Center for Archeological Studies, Pavlodar, Kazakhstan
- Department of Archaeology, Ethnography and Museology, Altai State University, Barnaul, Russia
| | - Viktor Merts
- Toraighyrov University, Joint Research Center for Archeological Studies, Pavlodar, Kazakhstan
| | - Aleksei K Kasparov
- Institute of the History of Material Culture, Russian Academy of Sciences, St. Petersburg, Russia
| | - Vladimir V Pitulko
- Institute of the History of Material Culture, Russian Academy of Sciences, St. Petersburg, Russia
- Peter the Great Museum of Anthropology and Ethnography (Kunstkamera), Russian Academy of Sciences, St Petersburg, Russia
| | - Vedat Onar
- Osteoarchaeology Practice and Research Center and Department of Anatomy, Faculty of Veterinary Medicine, Istanbul University-Cerrahpaşa, Istanbul, Türkiye
| | - Aliye Öztan
- Archaeology Department, Ankara University, Ankara, Türkiye
| | - Benjamin S Arbuckle
- Department of Anthropology, Alumni Building, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Hugh McColl
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Gabriel Renaud
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark (DTU), Copenhagen, Denmark
| | - Ruslan Khaskhanov
- Kh. Ibragimov Complex Institute of the Russian Academy of Sciences (CI RAS), Grozny, Russia
| | - Sergey Demidenko
- Institute of Archaeology, Russian Academy of Sciences, Moscow, Russia
| | - Anna Kadieva
- Department of Archaeological Monuments, State Historical Museum, Moscow, Russian Federation
| | | | | | - Gabriella Lindgren
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
- Center for Animal Breeding and Genetics, Department of Biosystems, KU Leuven, Leuven, Belgium
| | - F Javier López-Cachero
- Institut d'Arqueologia de la Universitat de Barcelona (IAUB), Seminari d'Estudis i Recerques Prehistoriques (SERP-UB), Universitat de Barcelona (UB), Barcelona, Spain
| | - Silvia Albizuri
- Institut d'Arqueologia de la Universitat de Barcelona (IAUB), Seminari d'Estudis i Recerques Prehistoriques (SERP-UB), Universitat de Barcelona (UB), Barcelona, Spain
| | - Tajana Trbojević Vukičević
- Department of Anatomy, Histology and Embryology, Faculty of Veterinary Medicine, University of Zagreb, Zagreb, Croatia
| | | | - Marcel Burić
- Department of Archaeology, Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb, Croatia
| | | | - Jaco Weinstock
- Faculty of Arts and Humanities (Archaeology), University of Southampton, Southampton, UK
| | - David Asensio Vilaró
- Secció de Prehistòria i Arqueologia, IAUB Institut d'Arqueologia de la Universitat de Barcelona, Barcelona, Spain
| | - Ferran Codina
- C/Major, 20, Norfeu, Arqueologia Art i Patrimoni S.C., La Tallada d'Empordà, Spain
| | | | | | - Josep Pou
- Ajuntament de Calafell, Calafell (Tarragona), Spain
| | - Gabriel de Prado
- Museu d'Arqueologia de Catalunya (MAC-Ullastret), Ullastret, Spain
| | - Joan Sanmartí
- IEC-Institut d'Estudis Catalans (Union Académique Internationale), Barcelona, Spain
- Departament d'Història i Arqueologia, Facultat de Geografia i Història, Universitat de Barcelona, Barcelona, Spain
| | - Nabil Kallala
- Ecole Tunisienne d'Histoire et d'Anthropologie, Tunis, Tunisia
- University of Tunis, Institut National du Patrimoine, Tunis, Tunisia
| | | | | | - Maria-Carme Belarte Franco
- IEC-Institut d'Estudis Catalans (Union Académique Internationale), Barcelona, Spain
- ICREA, Catalan Institution for Research and Advanced Studies, Barcelona, Spain
- ICAC (Catalan Institute of Classical Archaeology), Tarragona, Spain
| | - Silvia Valenzuela-Lamas
- Archaeology of Social Dynamics (ASD), Institució Milà i Fontanals, Consejo Superior de Investigaciones Científicas (IMF-CSIC), Barcelona, Spain
- UNIARQ - Unidade de Arqueologia, Universidade de Lisboa, Alameda da Universidade, Lisboa, Portugal
| | - Antoine Zazzo
- Centre National de Recherche Scientifique, Muséum national d'Histoire naturelle, Archéozoologie, Archéobotanique (AASPE), CP 56, Paris, France
| | - Sébastien Lepetz
- Centre National de Recherche Scientifique, Muséum national d'Histoire naturelle, Archéozoologie, Archéobotanique (AASPE), CP 56, Paris, France
| | - Sylvie Duchesne
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | - Anatoly Alexeev
- Institute for Humanities Research and Indigenous Studies of the North (IHRISN), Yakutsk, Russia
| | - Jamsranjav Bayarsaikhan
- Max Planck Institute of Geoanthropology, Jena, Germany
- Institute of Archaeology, Mongolian Academy of Science, Ulaanbaatar, Mongolia
| | - Jean-Luc Houle
- Department of Folk Studies and Anthropology, Western Kentucky University, Bowling Green, KY, USA
| | - Noost Bayarkhuu
- Archaeological Research Center and Department of Anthropology and Archaeology, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Tsagaan Turbat
- Archaeological Research Center and Department of Anthropology and Archaeology, National University of Mongolia, Ulaanbaatar, Mongolia
| | - Éric Crubézy
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France
| | | | - Marjan Mashkour
- Centre National de Recherche Scientifique, Muséum national d'Histoire naturelle, Archéozoologie, Archéobotanique (AASPE), CP 56, Paris, France
- Central Laboratory, Bioarchaeology Laboratory, Archaeozoology section, University of Tehran, Tehran, Iran
| | - Natalia Ya Berezina
- Research Institute and Museum of Anthropology, Lomonosov Moscow State University, Moscow, Russia
| | - Dmitriy S Korobov
- Institute of Archaeology, Russian Academy of Sciences, Moscow, Russia
| | | | | | - Jean-Paul Demoule
- UMR du CNRS 8215 Trajectoires, Institut d'Art et Archéologie, Paris, France
| | - Sabine Reinhold
- Eurasia Department of the German Archaeological Institute, Berlin, Germany
| | - Svend Hansen
- Eurasia Department of the German Archaeological Institute, Berlin, Germany
| | - Barbara Wallner
- Institute of Animal Breeding and Genetics, Department of Biomedical Sciences, University of Veterinary Medicine Vienna, Vienna, Austria
| | - Natalia Roslyakova
- Department of Russian History and Archaeology, Samara State University of Social Sciences and Education, Samara, Russia
| | - Pavel F Kuznetsov
- Department of Russian History and Archaeology, Samara State University of Social Sciences and Education, Samara, Russia
| | - Alexey A Tishkin
- Department of Archaeology, Ethnography and Museology, Altai State University, Barnaul, Russia
| | - Patrick Wincker
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université d'Évry, Université Paris-Saclay, Évry, France
| | - Katherine Kanne
- Department of Archaeology and History, University of Exeter, Exeter, UK
- School of Archaeology, University College Dublin, Dublin, Ireland
| | - Alan Outram
- Department of Archaeology and History, University of Exeter, Exeter, UK
| | - Ludovic Orlando
- Centre d'Anthropobiologie et de Génomique de Toulouse, CNRS UMR 5288, Université Paul Sabatier, Faculté de Médecine Purpan, Toulouse, France.
| |
Collapse
|
27
|
Thompson A, Liebeskind BJ, Scully EJ, Landis MJ. Deep Learning and Likelihood Approaches for Viral Phylogeography Converge on the Same Answers Whether the Inference Model Is Right or Wrong. Syst Biol 2024; 73:183-206. [PMID: 38189575 PMCID: PMC11249978 DOI: 10.1093/sysbio/syad074] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 11/22/2023] [Accepted: 01/05/2024] [Indexed: 01/09/2024] Open
Abstract
Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.
Collapse
Affiliation(s)
- Ammon Thompson
- Participant in an Education Program Sponsored by U.S. Department of Defense (DOD) at the National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | | | - Erik J Scully
- National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO 63130, USA
| |
Collapse
|
28
|
Rehmann CT, Ralph PL, Kern AD. Evaluating evidence for co-geography in the Anopheles-Plasmodium host-parasite system. G3 (BETHESDA, MD.) 2024; 14:jkae008. [PMID: 38230808 PMCID: PMC10917517 DOI: 10.1093/g3journal/jkae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 11/08/2023] [Accepted: 12/22/2023] [Indexed: 01/18/2024]
Abstract
The often tight association between parasites and their hosts means that under certain scenarios, the evolutionary histories of the two species can become closely coupled both through time and across space. Using spatial genetic inference, we identify a potential signal of common dispersal patterns in the Anopheles gambiae and Plasmodium falciparum host-parasite system as seen through a between-species correlation of the differences between geographic sampling location and geographic location predicted from the genome. This correlation may be due to coupled dispersal dynamics between host and parasite but may also reflect statistical artifacts due to uneven spatial distribution of sampling locations. Using continuous-space population genetics simulations, we investigate the degree to which uneven distribution of sampling locations leads to bias in prediction of spatial location from genetic data and implement methods to counter this effect. We demonstrate that while algorithmic bias presents a problem in inference from spatio-genetic data, the correlation structure between A. gambiae and P. falciparum predictions cannot be attributed to spatial bias alone and is thus likely a genetic signal of co-dispersal in a host-parasite system.
Collapse
Affiliation(s)
- Clara T Rehmann
- Institute of Ecology and Evolution and Department of Biology, University of Oregon, Eugene 97403, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution and Department of Biology, University of Oregon, Eugene 97403, USA
- Department of Mathematics, University of Oregon, Eugene 97403, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution and Department of Biology, University of Oregon, Eugene 97403, USA
| |
Collapse
|
29
|
Ray DD, Flagel L, Schrider DR. IntroUNET: Identifying introgressed alleles via semantic segmentation. PLoS Genet 2024; 20:e1010657. [PMID: 38377104 PMCID: PMC10906877 DOI: 10.1371/journal.pgen.1010657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/01/2024] [Accepted: 01/29/2024] [Indexed: 02/22/2024] Open
Abstract
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
Collapse
Affiliation(s)
- Dylan D. Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Lex Flagel
- Division of Data Science, Gencove Inc., New York, New York, United States of America
- Department of Plant and Microbial Biology, University of Minnesota, Saint Paul, Minnesota, United States of America
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
30
|
Ray DD, Flagel L, Schrider DR. IntroUNET: identifying introgressed alleles via semantic segmentation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.02.07.527435. [PMID: 36865105 PMCID: PMC9979274 DOI: 10.1101/2023.02.07.527435] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
Collapse
Affiliation(s)
- Dylan D. Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Lex Flagel
- Division of Data Science, Gencove Inc., New York, NY 11101, USA
- Department of Plant and Microbial Biology, University of Minnesota, St Paul MN, 55108, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
31
|
McGaughran A, Dhami MK, Parvizi E, Vaughan AL, Gleeson DM, Hodgins KA, Rollins LA, Tepolt CK, Turner KG, Atsawawaranunt K, Battlay P, Congrains C, Crottini A, Dennis TPW, Lange C, Liu XP, Matheson P, North HL, Popovic I, Rius M, Santure AW, Stuart KC, Tan HZ, Wang C, Wilson J. Genomic Tools in Biological Invasions: Current State and Future Frontiers. Genome Biol Evol 2024; 16:evad230. [PMID: 38109935 PMCID: PMC10776249 DOI: 10.1093/gbe/evad230] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/16/2023] [Accepted: 12/12/2023] [Indexed: 12/20/2023] Open
Abstract
Human activities are accelerating rates of biological invasions and climate-driven range expansions globally, yet we understand little of how genomic processes facilitate the invasion process. Although most of the literature has focused on underlying phenotypic correlates of invasiveness, advances in genomic technologies are showing a strong link between genomic variation and invasion success. Here, we consider the ability of genomic tools and technologies to (i) inform mechanistic understanding of biological invasions and (ii) solve real-world issues in predicting and managing biological invasions. For both, we examine the current state of the field and discuss how genomics can be leveraged in the future. In addition, we make recommendations pertinent to broader research issues, such as data sovereignty, metadata standards, collaboration, and science communication best practices that will require concerted efforts from the global invasion genomics community.
Collapse
Affiliation(s)
- Angela McGaughran
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Manpreet K Dhami
- Biocontrol and Molecular Ecology, Manaaki Whenua Landcare Research, Lincoln, New Zealand
- School of Biological Sciences, Waipapa Taumata Rau/University of Auckland, Auckland, New Zealand
| | - Elahe Parvizi
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Amy L Vaughan
- Biocontrol and Molecular Ecology, Manaaki Whenua Landcare Research, Lincoln, New Zealand
| | - Dianne M Gleeson
- Centre for Conservation Ecology and Genomics, Faculty of Science and Technology, University of Canberra, Canberra, ACT, Australia
| | - Kathryn A Hodgins
- School of Biological Sciences, Monash University, Melbourne, VIC, Australia
| | - Lee A Rollins
- Evolution and Ecology Research Centre, University of New South Wales, Sydney, NSW, Australia
| | - Carolyn K Tepolt
- Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA
| | - Kathryn G Turner
- Department of Biological Sciences, Idaho State University, Pocatello, ID, USA
| | - Kamolphat Atsawawaranunt
- School of Biological Sciences, Waipapa Taumata Rau/University of Auckland, Auckland, New Zealand
| | - Paul Battlay
- School of Biological Sciences, Monash University, Melbourne, VIC, Australia
| | - Carlos Congrains
- Entomology Section, Department of Plant and Environmental Protection Sciences, University of Hawaiʻi at Mānoa, Honolulu, HI 96822, USA
- US Department of Agriculture-Agricultural Research Service, Daniel K. Inouye US Pacific Basin Agricultural Research Center, Hilo, HI 96720, USA
| | - Angelica Crottini
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO Laboratório Associado, Campus de Vairão, Universidade do Porto, Vairão 4485-661, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Porto 4169–007, Portugal
- BIOPOLIS Program in Genomics, Biodiversity and Land Planning, CIBIO, Vairão 4485-661, Portugal
| | - Tristan P W Dennis
- Department of Vector Biology, Liverpool School of Tropical Medicine, Liverpool, UK
| | - Claudia Lange
- Biocontrol and Molecular Ecology, Manaaki Whenua Landcare Research, Lincoln, New Zealand
| | - Xiaoyue P Liu
- Department of Marine Science, University of Otago, Dunedin, New Zealand
| | - Paige Matheson
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Henry L North
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Iva Popovic
- School of the Environment, University of Queensland, Brisbane, QLD, Australia
| | - Marc Rius
- Centre for Advanced Studies of Blanes (CEAB, CSIC), Accés a la Cala Sant Francesc, Blanes, Spain
- Department of Zoology, Centre for Ecological Genomics and Wildlife Conservation, University of Johannesburg, Johannesburg 2006, South Africa
| | - Anna W Santure
- School of Biological Sciences, Waipapa Taumata Rau/University of Auckland, Auckland, New Zealand
| | - Katarina C Stuart
- School of Biological Sciences, Waipapa Taumata Rau/University of Auckland, Auckland, New Zealand
| | - Hui Zhen Tan
- School of Biological Sciences, Waipapa Taumata Rau/University of Auckland, Auckland, New Zealand
| | - Cui Wang
- The Organismal and Evolutionary Biology Research Programme, University of Helsinki, Helsinki, Finland
| | - Jonathan Wilson
- School of Biological Sciences, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
32
|
Huang X, Rymbekova A, Dolgova O, Lao O, Kuhlwilm M. Harnessing deep learning for population genetic inference. Nat Rev Genet 2024; 25:61-78. [PMID: 37666948 DOI: 10.1038/s41576-023-00636-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 09/06/2023]
Abstract
In population genetics, the emergence of large-scale genomic data for various species and populations has provided new opportunities to understand the evolutionary forces that drive genetic diversity using statistical inference. However, the era of population genomics presents new challenges in analysing the massive amounts of genomes and variants. Deep learning has demonstrated state-of-the-art performance for numerous applications involving large-scale data. Recently, deep learning approaches have gained popularity in population genetics; facilitated by the advent of massive genomic data sets, powerful computational hardware and complex deep learning architectures, they have been used to identify population structure, infer demographic history and investigate natural selection. Here, we introduce common deep learning architectures and provide comprehensive guidelines for implementing deep learning models for population genetic inference. We also discuss current challenges and future directions for applying deep learning in population genetics, focusing on efficiency, robustness and interpretability.
Collapse
Affiliation(s)
- Xin Huang
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| | - Aigerim Rymbekova
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
| | - Olga Dolgova
- Integrative Genomics Laboratory, CIC bioGUNE - Centro de Investigación Cooperativa en Biociencias, Derio, Biscaya, Spain
| | - Oscar Lao
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
| | - Martin Kuhlwilm
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria.
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria.
| |
Collapse
|
33
|
Rehmann CT, Ralph PL, Kern AD. Evaluating evidence for co-geography in the Anopheles-Plasmodium host-parasite system. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.17.549405. [PMID: 37503196 PMCID: PMC10370088 DOI: 10.1101/2023.07.17.549405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
The often tight association between parasites and their hosts means that under certain scenarios, the evolutionary histories of the two species can become closely coupled both through time and across space. Using spatial genetic inference, we identify a potential signal of common dispersal patterns in the Anopheles gambiae and Plasmodium falciparum host-parasite system as seen through a between-species correlation of the differences between geographic sampling location and geographic location predicted from the genome. This correlation may be due to coupled dispersal dynamics between host and parasite, but may also reflect statistical artifacts due to uneven spatial distribution of sampling locations. Using continuous-space population genetics simulations, we investigate the degree to which uneven distribution of sampling locations leads to bias in prediction of spatial location from genetic data and implement methods to counter this effect. We demonstrate that while algorithmic bias presents a problem in inference from spatio-genetic data, the correlation structure between A. gambiae and P. falciparum predictions cannot be attributed to spatial bias alone, and is thus likely a genetic signal of co-dispersal in a host-parasite system.
Collapse
Affiliation(s)
- Clara T Rehmann
- University of Oregon, Institute of Ecology and Evolution and Department of Biology
| | - Peter L Ralph
- University of Oregon, Institute of Ecology and Evolution and Department of Biology
- University of Oregon, Department of Mathematics
| | - Andrew D Kern
- University of Oregon, Institute of Ecology and Evolution and Department of Biology
| |
Collapse
|
34
|
Kloska A, Giełczyk A, Grzybowski T, Płoski R, Kloska SM, Marciniak T, Pałczyński K, Rogalla-Ładniak U, Malyarchuk BA, Derenko MV, Kovačević-Grujičić N, Stevanović M, Drakulić D, Davidović S, Spólnicka M, Zubańska M, Woźniak M. A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe. Int J Mol Sci 2023; 24:15095. [PMID: 37894775 PMCID: PMC10606184 DOI: 10.3390/ijms242015095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/03/2023] [Accepted: 10/07/2023] [Indexed: 10/29/2023] Open
Abstract
Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.
Collapse
Affiliation(s)
- Anna Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Agata Giełczyk
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Grzybowski
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Rafał Płoski
- Department of Medical Genetics, Warsaw Medical University, 02106 Warsaw, Poland
| | - Sylwester M. Kloska
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
- Faculty of Medical Sciences, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Tomasz Marciniak
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Krzysztof Pałczyński
- Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland
| | - Urszula Rogalla-Ładniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| | - Boris A. Malyarchuk
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Miroslava V. Derenko
- Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia
| | - Nataša Kovačević-Grujičić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Milena Stevanović
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
- Faculty of Biology, University of Belgrade, 11000 Belgrade, Serbia
- Serbian Academy of Sciences and Arts, 11000 Belgrade, Serbia
| | - Danijela Drakulić
- Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia
| | - Slobodan Davidović
- Institute for Biological Research “Siniša Stanković”, National Institute of Republic of Serbia, University of Belgrade, 11060 Belgrade, Serbia
| | | | - Magdalena Zubańska
- Faculty of Law and Administration, Department of Criminology and Forensic Sciences, University of Warmia and Mazury, 10726 Olsztyn, Poland
| | - Marcin Woźniak
- Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland
| |
Collapse
|
35
|
Nait Saada J, Tsangalidou Z, Stricker M, Palamara PF. Inference of Coalescence Times and Variant Ages Using Convolutional Neural Networks. Mol Biol Evol 2023; 40:msad211. [PMID: 37738175 PMCID: PMC10581698 DOI: 10.1093/molbev/msad211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 09/11/2023] [Accepted: 09/18/2023] [Indexed: 09/24/2023] Open
Abstract
Accurate inference of the time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses. We developed a likelihood-free approach, called CoalNN, which uses a convolutional neural network to predict pairwise TMRCAs and allele ages from sequencing or SNP array data. CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. Across several simulated scenarios, CoalNN matched or outperformed the accuracy of model-based approaches for pairwise TMRCA and allele age prediction. We applied CoalNN to settings for which model-based approaches are under-developed and performed analyses to gain insights into the set of features it uses to perform TMRCA prediction. We next used CoalNN to analyze 2,504 samples from 26 populations in the 1,000 Genome Project data set, inferring the age of ∼80 million variants. We observed substantial variation across populations and for variants predicted to be pathogenic, reflecting heterogeneous demographic histories and the action of negative selection. We used CoalNN's predicted allele ages to construct genome-wide annotations capturing the signature of past negative selection. We performed LD-score regression analysis of heritability using summary association statistics from 63 independent complex traits and diseases (average N=314k), observing increased annotation-specific effects on heritability compared to a previous allele age annotation. These results highlight the effectiveness of using likelihood-free, simulation-trained models to infer properties of gene genealogies in large genomic data sets.
Collapse
Affiliation(s)
| | | | | | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| |
Collapse
|
36
|
Mantes AD, Montserrat DM, Bustamante CD, Giró-i-Nieto X, Ioannidis AG. Neural ADMIXTURE for rapid genomic clustering. NATURE COMPUTATIONAL SCIENCE 2023; 3:621-629. [PMID: 37600116 PMCID: PMC10438426 DOI: 10.1038/s43588-023-00482-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 06/06/2023] [Indexed: 08/22/2023]
Abstract
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by calculating multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
Collapse
Affiliation(s)
- Albert Dominguez Mantes
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Vaud, Switzerland
| | - Daniel Mas Montserrat
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
| | | | - Xavier Giró-i-Nieto
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Alexander G. Ioannidis
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, United States
| |
Collapse
|
37
|
Smith CCR, Tittes S, Ralph PL, Kern AD. Dispersal inference from population genetic variation using a convolutional neural network. Genetics 2023; 224:iyad068. [PMID: 37052957 PMCID: PMC10213498 DOI: 10.1093/genetics/iyad068] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 02/08/2023] [Accepted: 04/07/2023] [Indexed: 04/14/2023] Open
Abstract
The geographic nature of biological dispersal shapes patterns of genetic variation over landscapes, making it possible to infer properties of dispersal from genetic variation data. Here, we present an inference tool that uses geographically distributed genotype data in combination with a convolutional neural network to estimate a critical population parameter: the mean per-generation dispersal distance. Using extensive simulation, we show that our deep learning approach is competitive with or outperforms state-of-the-art methods, particularly at small sample sizes. In addition, we evaluate varying nuisance parameters during training-including population density, demographic history, habitat size, and sampling area-and show that this strategy is effective for estimating dispersal distance when other model parameters are unknown. Whereas competing methods depend on information about local population density or accurate inference of identity-by-descent tracts, our method uses only single-nucleotide-polymorphism data and the spatial scale of sampling as input. Strikingly, and unlike other methods, our method does not use the geographic coordinates of the genotyped individuals. These features make our method, which we call "disperseNN," a potentially valuable new tool for estimating dispersal distance in nonmodel systems with whole genome data or reduced representation data. We apply disperseNN to 12 different species with publicly available data, yielding reasonable estimates for most species. Importantly, our method estimated consistently larger dispersal distances than mark-recapture calculations in the same species, which may be due to the limited geographic sampling area covered by some mark-recapture studies. Thus genetic tools like ours complement direct methods for improving our understanding of dispersal.
Collapse
Affiliation(s)
- Chris C R Smith
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Silas Tittes
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
| |
Collapse
|
38
|
Ahlquist KD, Sugden LA, Ramachandran S. Enabling interpretable machine learning for biological data with reliability scores. PLoS Comput Biol 2023; 19:e1011175. [PMID: 37235578 PMCID: PMC10249903 DOI: 10.1371/journal.pcbi.1011175] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/08/2023] [Accepted: 05/10/2023] [Indexed: 05/28/2023] Open
Abstract
Machine learning tools have proven useful across biological disciplines, allowing researchers to draw conclusions from large datasets, and opening up new opportunities for interpreting complex and heterogeneous biological data. Alongside the rapid growth of machine learning, there have also been growing pains: some models that appear to perform well have later been revealed to rely on features of the data that are artifactual or biased; this feeds into the general criticism that machine learning models are designed to optimize model performance over the creation of new biological insights. A natural question arises: how do we develop machine learning models that are inherently interpretable or explainable? In this manuscript, we describe the SWIF(r) reliability score (SRS), a method building on the SWIF(r) generative framework that reflects the trustworthiness of the classification of a specific instance. The concept of the reliability score has the potential to generalize to other machine learning methods. We demonstrate the utility of the SRS when faced with common challenges in machine learning including: 1) an unknown class present in testing data that was not present in training data, 2) systemic mismatch between training and testing data, and 3) instances of testing data that have missing values for some attributes. We explore these applications of the SRS using a range of biological datasets, from agricultural data on seed morphology, to 22 quantitative traits in the UK Biobank, and population genetic simulations and 1000 Genomes Project data. With each of these examples, we demonstrate how the SRS can allow researchers to interrogate their data and training approach thoroughly, and to pair their domain-specific knowledge with powerful machine-learning frameworks. We also compare the SRS to related tools for outlier and novelty detection, and find that it has comparable performance, with the advantage of being able to operate when some data are missing. The SRS, and the broader discussion of interpretable scientific machine learning, will aid researchers in the biological machine learning space as they seek to harness the power of machine learning without sacrificing rigor and biological insight.
Collapse
Affiliation(s)
- K. D. Ahlquist
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Molecular Biology, Cell Biology, and Biochemistry, Brown University, Providence, Rhode Island, United States of America
| | - Lauren A. Sugden
- Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, Pennsylvania, United States of America
| | - Sohini Ramachandran
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology, Evolution and Organismal Biology, Brown University, Providence, Rhode Island, United States of America
- Data Science Initiative, Brown University, Providence, Rhode Island, United States of America
| |
Collapse
|
39
|
Hamid I, Korunes KL, Schrider DR, Goldberg A. Localizing Post-Admixture Adaptive Variants with Object Detection on Ancestry-Painted Chromosomes. Mol Biol Evol 2023; 40:msad074. [PMID: 36947126 PMCID: PMC10116606 DOI: 10.1093/molbev/msad074] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 03/14/2023] [Accepted: 03/20/2023] [Indexed: 03/23/2023] Open
Abstract
Gene flow between previously differentiated populations during the founding of an admixed or hybrid population has the potential to introduce adaptive alleles into the new population. If the adaptive allele is common in one source population, but not the other, then as the adaptive allele rises in frequency in the admixed population, genetic ancestry from the source containing the adaptive allele will increase nearby as well. Patterns of genetic ancestry have therefore been used to identify post-admixture positive selection in humans and other animals, including examples in immunity, metabolism, and animal coloration. A common method identifies regions of the genome that have local ancestry "outliers" compared with the distribution across the rest of the genome, considering each locus independently. However, we lack theoretical models for expected distributions of ancestry under various demographic scenarios, resulting in potential false positives and false negatives. Further, ancestry patterns between distant sites are often not independent. As a result, current methods tend to infer wide genomic regions containing many genes as under selection, limiting biological interpretation. Instead, we develop a deep learning object detection method applied to images generated from local ancestry-painted genomes. This approach preserves information from the surrounding genomic context and avoids potential pitfalls of user-defined summary statistics. We find the method is robust to a variety of demographic misspecifications using simulated data. Applied to human genotype data from Cabo Verde, we localize a known adaptive locus to a single narrow region compared with multiple or long windows obtained using two other ancestry-based methods.
Collapse
Affiliation(s)
- Iman Hamid
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| | | | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| | - Amy Goldberg
- Department of Evolutionary Anthropology, Duke University, Durham, NC
| |
Collapse
|
40
|
Estimating human mobility in Holocene Western Eurasia with large-scale ancient genomic data. Proc Natl Acad Sci U S A 2023; 120:e2218375120. [PMID: 36821583 PMCID: PMC9992830 DOI: 10.1073/pnas.2218375120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023] Open
Abstract
The recent increase in openly available ancient human DNA samples allows for large-scale meta-analysis applications. Trans-generational past human mobility is one of the key aspects that ancient genomics can contribute to since changes in genetic ancestry-unlike cultural changes seen in the archaeological record-necessarily reflect movements of people. Here, we present an algorithm for spatiotemporal mapping of genetic profiles, which allow for direct estimates of past human mobility from large ancient genomic datasets. The key idea of the method is to derive a spatial probability surface of genetic similarity for each individual in its respective past. This is achieved by first creating an interpolated ancestry field through space and time based on multivariate statistics and Gaussian process regression and then using this field to map the ancient individuals into space according to their genetic profile. We apply this algorithm to a dataset of 3138 aDNA samples with genome-wide data from Western Eurasia in the last 10,000 y. Finally, we condense this sample-wise record with a simple summary statistic into a diachronic measure of mobility for subregions in Western, Central, and Southern Europe. For regions and periods with sufficient data coverage, our similarity surfaces and mobility estimates show general concordance with previous results and provide a meta-perspective of genetic changes and human mobility.
Collapse
|
41
|
Korfmann K, Gaggiotti OE, Fumagalli M. Deep Learning in Population Genetics. Genome Biol Evol 2023; 15:evad008. [PMID: 36683406 PMCID: PMC9897193 DOI: 10.1093/gbe/evad008] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/19/2022] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, convolutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.
Collapse
Affiliation(s)
- Kevin Korfmann
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, Germany
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife KY16 9TF, UK
| | - Matteo Fumagalli
- Department of Biological and Behavioural Sciences, Queen Mary University of London, UK
| |
Collapse
|
42
|
Image Geo-Site Estimation Using Convolutional Auto-Encoder and Multi-Label Support Vector Machine. INFORMATION 2023. [DOI: 10.3390/info14010029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The estimation of an image geo-site solely based on its contents is a promising task. Compelling image labelling relies heavily on contextual information, which is not as simple as recognizing a single object in an image. An Auto-Encode-based support vector machine approach is proposed in this work to estimate the image geo-site to address the issue of misclassifying the estimations. The proposed method for geo-site estimation is conducted using a dataset consisting of 125 classes of various images captured within 125 countries. The proposed work uses a convolutional Auto-Encode for training and dimensionality reduction. After that, the acquired preprocessed input dataset is further processed by a multi-label support vector machine. The performance assessment of the proposed approach has been accomplished using accuracy, sensitivity, specificity, and F1-score as evaluation parameters. Eventually, the proposed approach for image geo-site estimation presented in this article outperforms Auto-Encode-based K-Nearest Neighbor and Auto-Encode-Random Forest methods.
Collapse
|
43
|
Vermant M, Goos T, Gogaert S, De Cock D, Verschueren P, Wuyts WA. Are genes the missing link to detect and prognosticate RA-ILD? Rheumatol Adv Pract 2023; 7:rkad023. [PMID: 36923263 PMCID: PMC10010659 DOI: 10.1093/rap/rkad023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/31/2023] [Indexed: 03/09/2023] Open
Affiliation(s)
- Marie Vermant
- Laboratory of Respiratory Diseases and Thoracic Surgery (BREATHE), Department of Chronic Diseases and Metabolism, KU Leuven, Leuven, Belgium.,Pulmonology, University Hospitals Leuven, Leuven, Belgium
| | - Tinne Goos
- Laboratory of Respiratory Diseases and Thoracic Surgery (BREATHE), Department of Chronic Diseases and Metabolism, KU Leuven, Leuven, Belgium.,Pulmonology, University Hospitals Leuven, Leuven, Belgium
| | - Stefan Gogaert
- Laboratory of Respiratory Diseases and Thoracic Surgery (BREATHE), Department of Chronic Diseases and Metabolism, KU Leuven, Leuven, Belgium.,Pulmonology, University Hospitals Leuven, Leuven, Belgium
| | - Diederik De Cock
- Biostatistics and Medical Informatics Research Group, Department of Public Health, Vrije Universiteit Brussel, Brussels, Belgium
| | - Patrick Verschueren
- Skeletal Biology and Engineering Research Center, Department of Development and Regeneration, KU Leuven, Leuven, Belgium.,Rheumatology, University Hospitals Leuven, Leuven, Belgium
| | - Wim A Wuyts
- Laboratory of Respiratory Diseases and Thoracic Surgery (BREATHE), Department of Chronic Diseases and Metabolism, KU Leuven, Leuven, Belgium.,Pulmonology, University Hospitals Leuven, Leuven, Belgium
| |
Collapse
|
44
|
Deelder W, Manko E, Phelan JE, Campino S, Palla L, Clark TG. Geographical classification of malaria parasites through applying machine learning to whole genome sequence data. Sci Rep 2022; 12:21150. [PMID: 36476815 PMCID: PMC9729610 DOI: 10.1038/s41598-022-25568-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 12/01/2022] [Indexed: 12/12/2022] Open
Abstract
Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.
Collapse
Affiliation(s)
- Wouter Deelder
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
- Dalberg Advisors, 7 Rue de Chantepoulet, 1201, Geneva, Switzerland
| | - Emilia Manko
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
| | - Jody E Phelan
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
| | - Susana Campino
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
| | - Luigi Palla
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK
- Department of Public Health and Infectious Diseases, University of Rome La Sapienza, Rome, Italy
| | - Taane G Clark
- London School of Hygiene & Tropical Medicine, Keppel Street, London, WC1E 7HT, UK.
| |
Collapse
|
45
|
Sanchez T, Bray EM, Jobic P, Guez J, Letournel AC, Charpiat G, Cury J, Jay F. dnadna: a deep learning framework for population genetics inference. Bioinformatics 2022; 39:6851140. [PMID: 36445000 PMCID: PMC9825738 DOI: 10.1093/bioinformatics/btac765] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 10/30/2022] [Accepted: 11/28/2022] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION We present dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination and re-usability of neural networks designed for population genetic data. RESULTS dnadna defines multiple user-friendly workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment, which saves time and decreases the likelihood of bugs. Second, the implemented networks can be re-optimized based on user-specified training sets and/or tasks. Newly implemented architectures and pre-trained networks are easily shareable with the community for further benchmarking or other applications. Finally, users can apply pre-trained networks in order to predict evolutionary history from alternative real or simulated genetic datasets, without requiring extensive knowledge in deep learning or coding in general. dnadna comes with a peer-reviewed, exchangeable neural network, allowing demographic inference from SNP data, that can be used directly or retrained to solve other tasks. Toy networks are also available to ease the exploration of the software, and we expect that the range of available architectures will keep expanding thanks to community contributions. AVAILABILITY AND IMPLEMENTATION dnadna is a Python (≥3.7) package, its repository is available at gitlab.com/mlgenetics/dnadna and its associated documentation at mlgenetics.gitlab.io/dnadna/.
Collapse
Affiliation(s)
| | | | - Pierre Jobic
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- ENS Paris-Saclay, 91190 Gif-sur-Yvette, France
| | - Jérémy Guez
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
- UMR7206 Eco-Anthropologie, Muséum National d’Histoire Naturelle, CNRS, Université de Paris, 75016 Paris, France
| | - Anne-Catherine Letournel
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Guillaume Charpiat
- Université Paris-Saclay, CNRS UMR 9015, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Jean Cury
- To whom correspondence should be addressed. or
| | - Flora Jay
- To whom correspondence should be addressed. or
| |
Collapse
|
46
|
Gretzinger J, Sayer D, Justeau P, Altena E, Pala M, Dulias K, Edwards CJ, Jodoin S, Lacher L, Sabin S, Vågene ÅJ, Haak W, Ebenesersdóttir SS, Moore KHS, Radzeviciute R, Schmidt K, Brace S, Bager MA, Patterson N, Papac L, Broomandkhoshbacht N, Callan K, Harney É, Iliev L, Lawson AM, Michel M, Stewardson K, Zalzala F, Rohland N, Kappelhoff-Beckmann S, Both F, Winger D, Neumann D, Saalow L, Krabath S, Beckett S, Van Twest M, Faulkner N, Read C, Barton T, Caruth J, Hines J, Krause-Kyora B, Warnke U, Schuenemann VJ, Barnes I, Dahlström H, Clausen JJ, Richardson A, Popescu E, Dodwell N, Ladd S, Phillips T, Mortimer R, Sayer F, Swales D, Stewart A, Powlesland D, Kenyon R, Ladle L, Peek C, Grefen-Peters S, Ponce P, Daniels R, Spall C, Woolcock J, Jones AM, Roberts AV, Symmons R, Rawden AC, Cooper A, Bos KI, Booth T, Schroeder H, Thomas MG, Helgason A, Richards MB, Reich D, Krause J, Schiffels S. The Anglo-Saxon migration and the formation of the early English gene pool. Nature 2022; 610:112-119. [PMID: 36131019 PMCID: PMC9534755 DOI: 10.1038/s41586-022-05247-2] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 08/17/2022] [Indexed: 11/09/2022]
Abstract
The history of the British Isles and Ireland is characterized by multiple periods of major cultural change, including the influential transformation after the end of Roman rule, which precipitated shifts in language, settlement patterns and material culture1. The extent to which migration from continental Europe mediated these transitions is a matter of long-standing debate2-4. Here we study genome-wide ancient DNA from 460 medieval northwestern Europeans-including 278 individuals from England-alongside archaeological data, to infer contemporary population dynamics. We identify a substantial increase of continental northern European ancestry in early medieval England, which is closely related to the early medieval and present-day inhabitants of Germany and Denmark, implying large-scale substantial migration across the North Sea into Britain during the Early Middle Ages. As a result, the individuals who we analysed from eastern England derived up to 76% of their ancestry from the continental North Sea zone, albeit with substantial regional variation and heterogeneity within sites. We show that women with immigrant ancestry were more often furnished with grave goods than women with local ancestry, whereas men with weapons were as likely not to be of immigrant ancestry. A comparison with present-day Britain indicates that subsequent demographic events reduced the fraction of continental northern European ancestry while introducing further ancestry components into the English gene pool, including substantial southwestern European ancestry most closely related to that seen in Iron Age France5,6.
Collapse
Affiliation(s)
- Joscha Gretzinger
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | | | | | | | - Maria Pala
- University of Huddersfield, Huddersfield, UK
| | - Katharina Dulias
- University of Huddersfield, Huddersfield, UK
- Institute of Geosystems and Bioindication, Technische Universität Braunschweig, Braunschweig, Germany
| | - Ceiridwen J Edwards
- University of Huddersfield, Huddersfield, UK
- University of Oxford, Oxford, UK
| | | | - Laura Lacher
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Susanna Sabin
- Center for Evolution and Medicine, Arizona State University, Tempe, AZ, USA
| | - Åshild J Vågene
- Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Wolfgang Haak
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - S Sunna Ebenesersdóttir
- deCODE Genetics/AMGEN Inc., Reykjavík, Iceland
- Department of Anthropology, School of Social Sciences, University of Iceland, Reykjavík, Iceland
| | | | - Rita Radzeviciute
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | | | - Selina Brace
- Department of Earth Sciences, Natural History Museum, London, UK
| | - Martina Abenhus Bager
- Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Nick Patterson
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Luka Papac
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Nasreen Broomandkhoshbacht
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Kimberly Callan
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Éadaoin Harney
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Lora Iliev
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Ann Marie Lawson
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Megan Michel
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Kristin Stewardson
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Fatma Zalzala
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Nadin Rohland
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | | | - Frank Both
- Landesmuseum Natur und Mensch, Oldenburg, Germany
| | | | | | - Lars Saalow
- Landesamt für Kultur und Denkmalpflege Mecklenburg-Vorpommern, Schwerin, Germany
| | - Stefan Krabath
- Institute for Historical Coastal Research (NIhK), Wilhelmshaven, Germany
| | - Sophie Beckett
- Sedgeford Historical and Archaeological Research Project, Sedgeford, UK
- Cranfield Forensic Institute, Cranfield Defence and Security, Cranfield University, Cranfield, UK
- Melbourne Dental School, University of Melbourne, Melbourne, Victoria, Australia
| | - Melanie Van Twest
- Sedgeford Historical and Archaeological Research Project, Sedgeford, UK
| | - Neil Faulkner
- Sedgeford Historical and Archaeological Research Project, Sedgeford, UK
| | - Chris Read
- The Atlantic Technological University, Sligo, Ireland
| | | | | | | | | | | | - Verena J Schuenemann
- University of Zurich, Zurich, Switzerland
- Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
- Human Evolution and Archaeological Sciences, University of Vienna, Vienna, Austria
| | - Ian Barnes
- Department of Earth Sciences, Natural History Museum, London, UK
| | | | | | - Andrew Richardson
- Canterbury Archaeological Trust, Canterbury, UK
- Isle Heritage CIC, Sandgate, UK
| | | | | | | | | | - Richard Mortimer
- Oxford Archaeology East, Cambridge, UK
- Cotswold Archaeology, Needham Market, UK
| | - Faye Sayer
- University of Birmingham, Birmingham, UK
| | - Diana Swales
- Centre for Anatomy and Human Identification (CAHID), University of Dundee, Dundee, UK
| | | | | | - Robert Kenyon
- East Dorset Antiquarian Society (EDAS), West Bexington, UK
| | - Lilian Ladle
- Department of Archaeology and Anthropology, Bournemouth University, Poole, UK
| | - Christina Peek
- Institute for Historical Coastal Research (NIhK), Wilhelmshaven, Germany
| | | | | | | | | | | | | | | | | | - Anooshka C Rawden
- Fishbourne Roman Palace, Fishbourne, UK
- South Downs Centre, Midhurst, UK
| | - Alan Cooper
- BlueSkyGenetics, Adelaide, South Australia, Australia
| | - Kirsten I Bos
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | | | - Hannes Schroeder
- Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Agnar Helgason
- deCODE Genetics/AMGEN Inc., Reykjavík, Iceland
- Department of Anthropology, School of Social Sciences, University of Iceland, Reykjavík, Iceland
| | | | - David Reich
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA, USA
| | - Johannes Krause
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Stephan Schiffels
- Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
| |
Collapse
|
47
|
Nikolakis ZL, Adams RH, Wade KJ, Lund AJ, Carlton EJ, Castoe TA, Pollock DD. Prospects for genomic surveillance for selection in schistosome parasites. FRONTIERS IN EPIDEMIOLOGY 2022; 2:932021. [PMID: 38455290 PMCID: PMC10910990 DOI: 10.3389/fepid.2022.932021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/12/2022] [Indexed: 03/09/2024]
Abstract
Schistosomiasis is a neglected tropical disease caused by multiple parasitic Schistosoma species, and which impacts over 200 million people globally, mainly in low- and middle-income countries. Genomic surveillance to detect evidence for natural selection in schistosome populations represents an emerging and promising approach to identify and interpret schistosome responses to ongoing control efforts or other environmental factors. Here we review how genomic variation is used to detect selection, how these approaches have been applied to schistosomes, and how future studies to detect selection may be improved. We discuss the theory of genomic analyses to detect selection, identify experimental designs for such analyses, and review studies that have applied these approaches to schistosomes. We then consider the biological characteristics of schistosomes that are expected to respond to selection, particularly those that may be impacted by control programs. Examples include drug resistance, host specificity, and life history traits, and we review our current understanding of specific genes that underlie them in schistosomes. We also discuss how inherent features of schistosome reproduction and demography pose substantial challenges for effective identification of these traits and their genomic bases. We conclude by discussing how genomic surveillance for selection should be designed to improve understanding of schistosome biology, and how the parasite changes in response to selection.
Collapse
Affiliation(s)
- Zachary L. Nikolakis
- Department of Biology, University of Texas at Arlington, Arlington, TX, United States
| | - Richard H. Adams
- Department of Biological and Environmental Sciences, Georgia College and State University, Milledgeville, GA, United States
| | - Kristen J. Wade
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, United States
| | - Andrea J. Lund
- Department of Environmental and Occupational Health, Colorado School of Public Health, University of Colorado, Anschutz, Aurora, CO, United States
| | - Elizabeth J. Carlton
- Department of Environmental and Occupational Health, Colorado School of Public Health, University of Colorado, Anschutz, Aurora, CO, United States
| | - Todd A. Castoe
- Department of Biology, University of Texas at Arlington, Arlington, TX, United States
| | - David D. Pollock
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, United States
| |
Collapse
|
48
|
Qin X, Chiang CWK, Gaggiotti OE. Deciphering signatures of natural selection via deep learning. Brief Bioinform 2022; 23:6686736. [PMID: 36056746 PMCID: PMC9487700 DOI: 10.1093/bib/bbac354] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/11/2022] [Accepted: 07/28/2022] [Indexed: 11/12/2022] Open
Abstract
Identifying genomic regions influenced by natural selection provides fundamental insights into the genetic basis of local adaptation. However, it remains challenging to detect loci under complex spatially varying selection. We propose a deep learning-based framework, DeepGenomeScan, which can detect signatures of spatially varying selection. We demonstrate that DeepGenomeScan outperformed principal component analysis- and redundancy analysis-based genome scans in identifying loci underlying quantitative traits subject to complex spatial patterns of selection. Noticeably, DeepGenomeScan increases statistical power by up to 47.25% under nonlinear environmental selection patterns. We applied DeepGenomeScan to a European human genetic dataset and identified some well-known genes under selection and a substantial number of clinically important genes that were not identified by SPA, iHS, Fst and Bayenv when applied to the same dataset.
Collapse
Affiliation(s)
- Xinghu Qin
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine & Department of Quantitative and Computational Biology, University of Southern California, USA
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| |
Collapse
|
49
|
Gloria-Soria A, Faraji A, Hamik J, White G, Amsberry S, Donahue M, Buss B, Pless E, Cosme LV, Powell JR. Origins of high latitude introductions of Aedes aegypti to Nebraska and Utah during 2019. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2022; 103:105333. [PMID: 35817397 DOI: 10.1016/j.meegid.2022.105333] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 06/27/2022] [Accepted: 07/05/2022] [Indexed: 06/15/2023]
Abstract
Aedes aegypti (L.), the yellow fever mosquito, is also an important vector of dengue and Zika viruses, and an invasive species in North America. Aedes aegypti inhabits tropical and sub-tropical areas of the world and in North America is primarily distributed throughout the southern US states and Mexico. The northern range of Ae. aegypti is limited by cold winter months and establishment in these areas has been mostly unsuccessful. However, frequent introductions of Ae. aegypti to temperate, non-endemic areas during the warmer months can lead to seasonal activity and disease outbreaks. Two Ae. aegypti incursions were reported in the late summer of 2019 into York, Nebraska and Moab, Utah. These states had no history of established populations of this mosquito and no evidence of previous seasonal activity. We genotyped a subset of individuals from each location at 12 microsatellite loci and ~ 14,000 single nucleotide polymorphic markers to determine their genetic affinities to other populations worldwide and investigate their potential source of introduction. Our results support a single origin for each of the introductions from different sources. Aedes aegypti from Utah likely derived from Tucson, Arizona, or a nearby location. Nebraska specimen results were not as conclusive, but point to an origin from southcentral or southeastern US. In addition to an effective, efficient, and sustainable control of invasive mosquitoes, such as Ae. aegypti, identifying the potential routes of introduction will be key to prevent future incursions and assess their potential health threat based on the ability of the source population to transmit a particular virus and its insecticide resistance profile, which may complicate vector control.
Collapse
Affiliation(s)
- Andrea Gloria-Soria
- Department of Entomology, Center for Vector Biology & Zoonotic Diseases, The Connecticut Agricultural Experiment Station, 123 Huntington Street, P.O. Box 1106, New Haven, CT 06511, USA; Yale University, Department of Ecology and Evolutionary Biology, 21 Sachem Street, New Haven, CT 06511, USA.
| | - Ary Faraji
- Salt Lake City Mosquito Abatement District, 2215 North 2200 West, Salt Lake City, UT 84116-1108, USA.
| | - Jeff Hamik
- Nebraska Department of Health and Human Services, Epidemiology and Informatics Unit, 301 Centennial Mall South, Lincoln, NE 68509, USA; University of Nebraska-Lincoln, Department of Educational Psychology, 114 Teachers College Hall, Lincoln, NE 68588, USA.
| | - Gregory White
- Salt Lake City Mosquito Abatement District, 2215 North 2200 West, Salt Lake City, UT 84116-1108, USA.
| | - Shanon Amsberry
- Moab Mosquito Abatement District, 1000 Sand Flats Rd, Moab, UT 84532, USA.
| | - Matthew Donahue
- Nebraska Department of Health and Human Services, Epidemiology and Informatics Unit, 301 Centennial Mall South, Lincoln, NE 68509, USA; Epidemic Intelligence Service, CDC, USA.
| | - Bryan Buss
- Nebraska Department of Health and Human Services, Epidemiology and Informatics Unit, 301 Centennial Mall South, Lincoln, NE 68509, USA; Career Epidemiology Field Officer Program, Division of State and Local Readiness, Center for Preparedness and Response, CDC, USA.
| | | | - Luciano Veiga Cosme
- Yale University, Department of Ecology and Evolutionary Biology, 21 Sachem Street, New Haven, CT 06511, USA.
| | - Jeffrey R Powell
- Yale University, Department of Ecology and Evolutionary Biology, 21 Sachem Street, New Haven, CT 06511, USA.
| |
Collapse
|
50
|
Qin X, Chiang CWK, Gaggiotti OE. KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis. Brief Bioinform 2022; 23:bbac202. [PMID: 35649387 PMCID: PMC9294434 DOI: 10.1093/bib/bbac202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/05/2022] [Accepted: 04/29/2022] [Indexed: 12/30/2022] Open
Abstract
Geographic patterns of human genetic variation provide important insights into human evolution and disease. A commonly used tool to detect and describe them is principal component analysis (PCA) or the supervised linear discriminant analysis of principal components (DAPC). However, genetic features produced from both approaches could fail to correctly characterize population structure for complex scenarios involving admixture. In this study, we introduce Kernel Local Fisher Discriminant Analysis of Principal Components (KLFDAPC), a supervised non-linear approach for inferring individual geographic genetic structure that could rectify the limitations of these approaches by preserving the multimodal space of samples. We tested the power of KLFDAPC to infer population structure and to predict individual geographic origin using neural networks. Simulation results showed that KLFDAPC has higher discriminatory power than PCA and DAPC. The application of our method to empirical European and East Asian genome-wide genetic datasets indicated that the first two reduced features of KLFDAPC correctly recapitulated the geography of individuals and significantly improved the accuracy of predicting individual geographic origin when compared to PCA and DAPC. Therefore, KLFDAPC can be useful for geographic ancestry inference, design of genome scans and correction for spatial stratification in GWAS that link genes to adaptation or disease susceptibility.
Collapse
Affiliation(s)
- Xinghu Qin
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine & Department of Quantitative and Computational Biology, University of Southern California, USA
| | - Oscar E Gaggiotti
- Centre for Biological Diversity, Sir Harold Mitchell Building, University of St Andrews, Fife, KY16 9TF, UK
| |
Collapse
|