1
|
Gravel S. Mapping a complex evolutionary history. Science 2025; 387:1352-1353. [PMID: 40146848 DOI: 10.1126/science.adw5484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2025]
Abstract
Tracking the geographic origins of genetic ancestors reveals past human migrations.
Collapse
Affiliation(s)
- Simon Gravel
- Department of Human Genetics, McGill University, Montreal, QC, Canada
| |
Collapse
|
2
|
Haag J, Jordan AI, Stamatakis A. Pandora: a tool to estimate dimensionality reduction stability of genotype data. BIOINFORMATICS ADVANCES 2025; 5:vbaf040. [PMID: 40160475 PMCID: PMC11955236 DOI: 10.1093/bioadv/vbaf040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Accepted: 02/27/2025] [Indexed: 04/02/2025]
Abstract
Motivation Genotype datasets typically contain a large number of single-nucleotide polymorphisms for a comparatively small number of individuals. To identify similarities between individuals and to infer an individual's origin or membership to a population, dimensionality reduction techniques are routinely deployed. However, inherent (technical) difficulties such as missing or noisy data need to be accounted for when analyzing a lower dimensional representation of genotype data, and the intrinsic uncertainty of such analyses should be reported in all studies. However, to date, there exists no stability assessment technique for genotype data that can estimate this uncertainty. Results Here, we present Pandora, a stability estimation framework for genotype data based on bootstrapping. Pandora computes an overall score to quantify the stability of the entire embedding, infers per-individual support values, and also deploys a k -means clustering approach to assess the uncertainty of assignments to potential cultural groups. Using published empirical and simulated datasets, we demonstrate the usage and utility of Pandora for studies that rely on dimensionality reduction techniques. Availability and implementation Pandora is available on GitHub: https://github.com/tschuelia/Pandora.
Collapse
Affiliation(s)
- Julia Haag
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Alexander I Jordan
- Computational Statistics Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
- Biodiversity Computing Group, Institute of Computer Science, Foundation for Research and Technology—Hellas, Heraklion, Crete 70013, Greece
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| |
Collapse
|
3
|
Goli RC, Chishi KG, Ganguly I, Singh S, Dixit S, Rathi P, Diwakar V, Sree C C, Limbalkar OM, Sukhija N, Kanaka K. Global and Local Ancestry and its Importance: A Review. Curr Genomics 2024; 25:237-260. [PMID: 39156729 PMCID: PMC11327809 DOI: 10.2174/0113892029298909240426094055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Revised: 03/02/2024] [Accepted: 03/11/2024] [Indexed: 08/20/2024] Open
Abstract
The fastest way to significantly change the composition of a population is through admixture, an evolutionary mechanism. In animal breeding history, genetic admixture has provided both short-term and long-term advantages by utilizing the phenomenon of complementarity and heterosis in several traits and genetic diversity, respectively. The traditional method of admixture analysis by pedigree records has now been replaced greatly by genome-wide marker data that enables more precise estimations. Among these markers, SNPs have been the popular choice since they are cost-effective, not so laborious, and automation of genotyping is easy. Certain markers can suggest the possibility of a population's origin from a sample of DNA where the source individual is unknown or unwilling to disclose their lineage, which are called Ancestry-Informative Markers (AIMs). Revealing admixture level at the locus-specific level is termed as local ancestry and can be exploited to identify signs of recent selective response and can account for genetic drift. Considering the importance of genetic admixture and local ancestry, in this mini-review, both concepts are illustrated, encompassing basics, their estimation/identification methods, tools/software used and their applications.
Collapse
Affiliation(s)
| | - Kiyevi G. Chishi
- ICAR-National Dairy Research Institute, Karnal, 132001, Haryana, India
| | - Indrajit Ganguly
- ICAR-National Bureau of Animal Genetic Resources, Karnal, 132001, Haryana, India
| | - Sanjeev Singh
- ICAR-National Bureau of Animal Genetic Resources, Karnal, 132001, Haryana, India
| | - S.P. Dixit
- ICAR-National Bureau of Animal Genetic Resources, Karnal, 132001, Haryana, India
| | - Pallavi Rathi
- ICAR-National Dairy Research Institute, Karnal, 132001, Haryana, India
| | - Vikas Diwakar
- ICAR-National Dairy Research Institute, Karnal, 132001, Haryana, India
| | - Chandana Sree C
- ICAR-National Dairy Research Institute, Karnal, 132001, Haryana, India
| | | | - Nidhi Sukhija
- ICAR-National Dairy Research Institute, Karnal, 132001, Haryana, India
- Central Tasar Research and Training Institute, Ranchi, 835303, Jharkhand, India
| | - K.K Kanaka
- ICAR- Indian Institute of Agricultural Biotechnology, Ranchi, 834010, Jharkhand, India
| |
Collapse
|
4
|
Bzdok D, Wolf G, Kopal J. Harnessing population diversity: in search of tools of the trade. Gigascience 2024; 13:giae068. [PMID: 39331809 PMCID: PMC11427908 DOI: 10.1093/gigascience/giae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Revised: 08/19/2024] [Accepted: 08/20/2024] [Indexed: 09/29/2024] Open
Abstract
Big neuroscience datasets are not big small datasets when it comes to quantitative data analysis. Neuroscience has now witnessed the advent of many population cohort studies that deep-profile participants, yielding hundreds of measures, capturing dimensions of each individual's position in the broader society. Indeed, there is a rebalancing from small, strictly selected, and thus homogenized cohorts toward always larger, more representative, and thus diverse cohorts. This shift in cohort composition is prompting the revision of incumbent modeling practices. Major sources of population stratification increasingly overshadow the subtle effects that neuroscientists are typically studying. In our opinion, as we sample individuals from always wider diversity backgrounds, we will require a new stack of quantitative tools to realize diversity-aware modeling. We here take inventory of candidate analytical frameworks. Better incorporating driving factors behind population structure will allow refining our understanding of how brain-behavior relationships depend on human subgroups.
Collapse
Affiliation(s)
- Danilo Bzdok
- MNI-Montreal Neurological Institute, Department of Biomedical Engineering, McGill University, Montreal, Quebec H3A 2B4, Canada
- MILA-Quebec Artificial Intelligence Institute, Montreal H2S 3H1, Canada
| | - Guy Wolf
- MILA-Quebec Artificial Intelligence Institute, Montreal H2S 3H1, Canada
| | - Jakub Kopal
- MNI-Montreal Neurological Institute, Department of Biomedical Engineering, McGill University, Montreal, Quebec H3A 2B4, Canada
- MILA-Quebec Artificial Intelligence Institute, Montreal H2S 3H1, Canada
| |
Collapse
|
5
|
Lewanski AL, Grundler MC, Bradburd GS. The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genet 2024; 20:e1011110. [PMID: 38236805 PMCID: PMC10796009 DOI: 10.1371/journal.pgen.1011110] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2024] Open
Abstract
In the presence of recombination, the evolutionary relationships between a set of sampled genomes cannot be described by a single genealogical tree. Instead, the genomes are related by a complex, interwoven collection of genealogies formalized in a structure called an ancestral recombination graph (ARG). An ARG extensively encodes the ancestry of the genome(s) and thus is replete with valuable information for addressing diverse questions in evolutionary biology. Despite its potential utility, technological and methodological limitations, along with a lack of approachable literature, have severely restricted awareness and application of ARGs in evolution research. Excitingly, recent progress in ARG reconstruction and simulation have made ARG-based approaches feasible for many questions and systems. In this review, we provide an accessible introduction and exploration of ARGs, survey recent methodological breakthroughs, and describe the potential for ARGs to further existing goals and open avenues of inquiry that were previously inaccessible in evolutionary genomics. Through this discussion, we aim to more widely disseminate the promise of ARGs in evolutionary genomics and encourage the broader development and adoption of ARG-based inference.
Collapse
Affiliation(s)
- Alexander L. Lewanski
- Department of Integrative Biology, Michigan State University, East Lansing, Michigan, United States of America
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Michael C. Grundler
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gideon S. Bradburd
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, Michigan, United States of America
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
6
|
Neijzen D, Lunter G. Unsupervised learning for medical data: A review of probabilistic factorization methods. Stat Med 2023; 42:5541-5554. [PMID: 37850249 DOI: 10.1002/sim.9924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/13/2023] [Indexed: 10/19/2023]
Abstract
We review popular unsupervised learning methods for the analysis of high-dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low-rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.
Collapse
Affiliation(s)
- Dorien Neijzen
- Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands
| | - Gerton Lunter
- Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands
- Weatherall Institute of Molecular Medicine, Oxford University, Oxford, UK
| |
Collapse
|
7
|
Köksal Z, Meyer OL, Andersen JD, Gusmão L, Mogensen HS, Pereira V, Børsting C. Pitfalls and challenges with population assignments of individuals from admixed populations: Applying Genogeographer on Brazilian individuals. Forensic Sci Int Genet 2023; 67:102934. [PMID: 37713981 DOI: 10.1016/j.fsigen.2023.102934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 06/06/2023] [Accepted: 09/06/2023] [Indexed: 09/17/2023]
Abstract
The assignment of individuals to a population can be of importance for the identification of mass disaster victims or criminal offenders in the field of forensic genetics. This assignment is based on biostatistical methods that process data of ancestry informative markers (AIMs), which are selected based on large allele frequency differences between the populations of interest. However, population assignments of individuals with an admixed genetic background are challenging. Admixed individuals are genetic mosaics of chromosomal segments from the parental populations, which may lead to ambiguous or no population assignment. This is problematic since admixture events are a substantial part of human history. In this study, we present challenges of interpreting the evidential weight of population assignments. We used Genogeographer for likelihood ratio (LR) calculations and Brazilians as examples of admixed individuals. Brazilians are a very heterogenous population representing a three-way admixture between Native Americans, Europeans, and Africans. Ancestry informative markers were typed in a total of 589 individuals from Brazil using the Precision ID Ancestry Panel. The Brazilians were assigned to six metapopulations (East Asia, Europe, Middle East, North Africa, South-Central Asia, Sub-Saharan Africa) defined in the Genogeographer software and LRs were calculated if the AIM profile was not an outlier in all metapopulations and simulated two-way (1:1) admixtures of the six metapopulations. Population assignments failed for 55% of the samples. These samples had significantly higher genetic contributions from East Asia, South-Central Asia and Sub-Saharan Africa, and significantly lower genetic contributions from Europe. Most of the individuals with population assignments were assigned to the metapopulations of Middle East (58%) or North Africa (36%), followed by Europe (4%), South-Central Asia (1%), and Sub-Saharan Africa (1%). For 8% of the samples, population assignments were only possible when assignments to simulated two-way (1:1) admixtures of the six metapopulations were considered. Most of these individuals were assigned to two-way admixtures of North Africa, South-Central Asia, or Sub-Saharan Africa. Relatively low median likelihood ratios (LRs<1000) were observed when comparing population likelihoods for Europe, Middle East, North Africa, South-Central Asia, or simulated 1:1 admixtures of these metapopulations. Comparisons including East Asian or Sub-Saharan African populations resulted in larger median LRs (LR>1010). The results suggested that the Precision ID Ancestry Panel provided too little information and that additional markers specifically selected for sub-continental differentiation may be required for accurate population assignment of admixed individuals. Furthermore, a Genogeographer database with additional populations including admixed populations would be advantageous for interpretation of admixed AIM profiles. It would likely increase the number of population assignments and illustrate alternatives to the most likely population, which would be valuable information for the case officer when writing the case report.
Collapse
Affiliation(s)
- Zehra Köksal
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Olivia Luxford Meyer
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Jeppe Dyrberg Andersen
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Leonor Gusmão
- DNA Diagnostic Laboratory (LDD), State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil
| | - Helle Smidt Mogensen
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Vania Pereira
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Claus Børsting
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
8
|
Yuan R, Li J, Ma X, Feng Z, Xing R, Chen S, Gao Q. Investigation of phylogenetic relationships within Saxifraga diversifolia complex (Saxifragaceae) based on restriction-site associated DNA sequence markers. Ecol Evol 2023; 13:e10675. [PMID: 37928197 PMCID: PMC10620575 DOI: 10.1002/ece3.10675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 09/28/2023] [Accepted: 10/16/2023] [Indexed: 11/07/2023] Open
Abstract
Subsect. Hirculoideae Engl. & Irmsch., belonging to Saxifraga sect. Ciliatae Haw., has high species richness. It can be divided into S. diversifolia, S. pseudohirculus, and S. sinomontana complexes based on morphological characteristics. The species with prominent leaf veins on the posterior leaf edge were placed in the S. diversifolia complex, which is mainly distributed on the eastern and southern margins of the Qinghai-Tibetan Plateau. In this study, 53 samples, representing 15 of the 33 described species in the S. diversifolia complex, were sequenced using the Restriction-site Associated DNA Sequence (RAD-seq) technique. A total of 111,938 high-quality SNP loci were screened to investigate the phylogenetic relationships within the S. diversifolia complex. The result of the neighbor-joining (NJ) tree shows that the S. diversifolia complex is a paraphyletic group. Despite of some inconsistencies as revealed by genetic structural analysis, clustering results of representative species reconstructed by both NJ and principal component analysis analyses support previous biogeographic and morphological evidences. In addition, long-distance gene flow events for 11 taxa were detected in the S. diversifolia complex, respectively from S. implicans 1 to S. implicans 2, S. diversifolia and S. maxionggouensis, and from S. maxionggouensis to S. nigroglandulifera. These findings may improve our comprehension of the phylogeny, classification, and evolution of the S. diversifolia complex.
Collapse
Affiliation(s)
- Rui Yuan
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
- University of Chinese Academy of SciencesBeijingChina
| | - Jiaxin Li
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
- University of Chinese Academy of SciencesBeijingChina
| | - Xiaolei Ma
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
- University of Chinese Academy of SciencesBeijingChina
| | - Zhilin Feng
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
- University of Chinese Academy of SciencesBeijingChina
| | - Rui Xing
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
| | - Shilong Chen
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
| | - Qingbo Gao
- Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology & Institute of Sanjiangyuan National ParkChinese Academy of SciencesXiningChina
- Qinghai Provincial Key Laboratory of Crop Molecular Breeding, Northwest Institute of Plateau BiologyChinese Academy of SciencesXiningChina
| |
Collapse
|
9
|
Moorjani P, Hellenthal G. Methods for Assessing Population Relationships and History Using Genomic Data. Annu Rev Genomics Hum Genet 2023; 24:305-332. [PMID: 37220313 PMCID: PMC11040641 DOI: 10.1146/annurev-genom-111422-025117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Genetic data contain a record of our evolutionary history. The availability of large-scale datasets of human populations from various geographic areas and timescales, coupled with advances in the computational methods to analyze these data, has transformed our ability to use genetic data to learn about our evolutionary past. Here, we review some of the widely used statistical methods to explore and characterize population relationships and history using genomic data. We describe the intuition behind commonly used approaches, their interpretation, and important limitations. For illustration, we apply some of these techniques to genome-wide autosomal data from 929 individuals representing 53 worldwide populations that are part of the Human Genome Diversity Project. Finally, we discuss the new frontiers in genomic methods to learn about population history. In sum, this review highlights the power (and limitations) of DNA to infer features of human evolutionary history, complementing the knowledge gleaned from other disciplines, such as archaeology, anthropology, and linguistics.
Collapse
Affiliation(s)
- Priya Moorjani
- Department of Molecular and Cell Biology and Center for Computational Biology, University of California, Berkeley, California, USA;
| | - Garrett Hellenthal
- UCL Genetics Institute and Research Department of Genetics, Evolution, and Environment, University College London, London, United Kingdom;
| |
Collapse
|
10
|
Anderson-Trocmé L, Nelson D, Zabad S, Diaz-Papkovich A, Kryukov I, Baya N, Touvier M, Jeffery B, Dina C, Vézina H, Kelleher J, Gravel S. On the genes, genealogies, and geographies of Quebec. Science 2023; 380:849-855. [PMID: 37228217 DOI: 10.1126/science.add5300] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 04/24/2023] [Indexed: 05/27/2023]
Abstract
Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.
Collapse
Affiliation(s)
- Luke Anderson-Trocmé
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- McGill University Genome Centre, Montreal, QC, Canada
| | - Dominic Nelson
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- McGill University Genome Centre, Montreal, QC, Canada
| | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Alex Diaz-Papkovich
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada
| | - Ivan Kryukov
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- McGill University Genome Centre, Montreal, QC, Canada
| | - Nikolas Baya
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Mathilde Touvier
- Sorbonne Paris Nord University, INSERM U1153, INRAE U1125, CNAM, Nutritional Epidemiology Research Team (EREN), Epidemiology and Statistics Research Center, University Paris Cité (CRESS), Bobigny, France
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Christian Dina
- Nantes Université, CNRS, INSERM, l'institut du thorax, Nantes, France
| | - Hélène Vézina
- BALSAC Project, Université du Québec á Chicoutimi, Chicoutimi, QC, Canada
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- McGill University Genome Centre, Montreal, QC, Canada
| |
Collapse
|
11
|
Yang J, Xu Y, Yao M, Wang G, Liu Z. ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data. BMC Bioinformatics 2023; 24:180. [PMID: 37131141 PMCID: PMC10155328 DOI: 10.1186/s12859-023-05305-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 04/25/2023] [Indexed: 05/04/2023] Open
Abstract
BACKGROUND Large-scale multi-ethnic DNA sequencing data is increasingly available owing to decreasing cost of modern sequencing technologies. Inference of the population structure with such sequencing data is fundamentally important. However, the ultra-dimensionality and complicated linkage disequilibrium patterns across the whole genome make it challenging to infer population structure using traditional principal component analysis based methods and software. RESULTS We present the ERStruct Python Package, which enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, our package achieves significant improvements in the speed of matrix operations for large-scale data. Additionally, our package features adaptive data splitting capabilities to facilitate computation on GPUs with limited memory. CONCLUSION Our Python package ERStruct is an efficient and user-friendly tool for estimating the number of top informative principal components that capture population structure from whole genome sequencing data.
Collapse
Affiliation(s)
- Jinghan Yang
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Yuyang Xu
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Minhao Yao
- Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Gao Wang
- Department of Neurology, Gertrude. H. Sergievsky Center, Columbia University, New York, NY, USA
| | - Zhonghua Liu
- Department of Biostatistics, Columbia University, New York, NY, USA.
| |
Collapse
|
12
|
Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H, Wang Q, Li M. GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species. Genome Biol 2023; 24:76. [PMID: 37069653 PMCID: PMC10108510 DOI: 10.1186/s13059-023-02906-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Accepted: 03/22/2023] [Indexed: 04/19/2023] Open
Abstract
Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC's data structure and algorithms are valuable for accelerating large-scale genomic research.
Collapse
Affiliation(s)
- Liubin Zhang
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Yangyang Yuan
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
- School of Medical Technology and Information Engineering, Zhejiang Chinese Medical University, Hangzhou, China
| | - Wenjie Peng
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Bin Tang
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Mulin Jun Li
- The Province and Ministry Co-Sponsored Collaborative Innovation Center for Medical Epigenetics, Tianjin Medical University, Tianjin, China
| | - Hongsheng Gui
- Behavioral Health Services, Henry Ford Health, Detroit, MI, USA
- Center for Health Policy & Health Services Research, Henry Ford Health, Detroit, MI, USA
| | - Qiang Wang
- Mental Health Center, West China Hospital, Sichuan University, Chengdu, China
| | - Miaoxin Li
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China.
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China.
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China.
- Key Laboratory of Tropical Disease Control (SYSU), Ministry of Education, Guangzhou, 510080, China.
- Guangdong Provincial Key Laboratory of Biomedical Imaging and Guangdong Provincial Engineering Research Center of Molecular Imaging, The Fifth Affiliated Hospital, Sun Yat-sen University, Zhuhai, China.
| |
Collapse
|
13
|
Rougemont Q, Xuereb A, Dallaire X, Moore JS, Normandeau E, Perreault-Payette A, Bougas B, Rondeau EB, Withler RE, Van Doornik DM, Crane PA, Naish KA, Garza JC, Beacham TD, Koop BF, Bernatchez L. Long-distance migration is a major factor driving local adaptation at continental scale in Coho salmon. Mol Ecol 2023; 32:542-559. [PMID: 35000273 DOI: 10.1111/mec.16339] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 11/19/2021] [Accepted: 12/23/2021] [Indexed: 01/25/2023]
Abstract
Inferring the genomic basis of local adaptation is a long-standing goal of evolutionary biology. Beyond its fundamental evolutionary implications, such knowledge can guide conservation decisions for populations of conservation and management concern. Here, we investigated the genomic basis of local adaptation in the Coho salmon (Oncorhynchus kisutch) across its entire North American range. We hypothesized that extensive spatial variation in environmental conditions and the species' homing behaviour may promote the establishment of local adaptation. We genotyped 7829 individuals representing 217 sampling locations at more than 100,000 high-quality RADseq loci to investigate how recombination might affect the detection of loci putatively under selection and took advantage of the precise description of the demographic history of the species from our previous work to draw accurate population genomic inferences about local adaptation. The results indicated that genetic differentiation scans and genetic-environment association analyses were both significantly affected by variation in recombination rate as low recombination regions displayed an increased number of outliers. By taking these confounding factors into consideration, we revealed that migration distance was the primary selective factor driving local adaptation and partial parallel divergence among distant populations. Moreover, we identified several candidate single nucleotide polymorphisms associated with long-distance migration and altitude including a gene known to be involved in adaptation to altitude in other species. The evolutionary implications of our findings are discussed along with conservation applications.
Collapse
Affiliation(s)
- Quentin Rougemont
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada.,CEFE, Centre d'Ecologie Fonctionnelle et Evolutive, UMR 5175, CNRS, Univ Montpellier, CNRS, EPHE, IRD, Univ Paul Valéry Montpellier, Montpellier, France
| | - Amanda Xuereb
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Xavier Dallaire
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Jean-Sébastien Moore
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Eric Normandeau
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Alysse Perreault-Payette
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Bérénice Bougas
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| | - Eric B Rondeau
- Department of Fisheries and Ocean, Pacific Biological Station, Nanaimo, British Columbia, Canada.,Department of Biology, University of Victoria, Victoria, British Columbia, Canada
| | - Ruth E Withler
- Department of Fisheries and Ocean, Pacific Biological Station, Nanaimo, British Columbia, Canada
| | - Donald M Van Doornik
- National Oceanic and Atmospheric Administration, National Marine Fisheries Service, Northwest Fisheries Science Center, Manchester Research Station, Port Orchard, Washington, USA
| | - Penelope A Crane
- Conservation Genetics Laboratory, U.S. Fish and Wildlife Service, Anchorage, Alaska, USA
| | - Kerry A Naish
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, Washington, USA
| | - John Carlos Garza
- Department of Ocean Sciences and Institute of Marine Sciences, University of California Santa Cruz, Santa Cruz, California, USA
| | - Terry D Beacham
- Department of Fisheries and Ocean, Pacific Biological Station, Nanaimo, British Columbia, Canada
| | - Ben F Koop
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
| | - Louis Bernatchez
- Département de Biologie, Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, Québec, Canada
| |
Collapse
|
14
|
Baxter‐Koenigs AR, El Nesr G, Barrick D. Singular value decomposition of protein sequences as a method to visualize sequence and residue space. Protein Sci 2022; 31:e4422. [PMID: 36173173 PMCID: PMC9514065 DOI: 10.1002/pro.4422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 07/05/2022] [Accepted: 08/06/2022] [Indexed: 11/08/2022]
Abstract
Singular value decomposition (SVD) of multiple sequence alignments (MSAs) is an important and rigorous method to identify subgroups of sequences within the MSA, and to extract consensus and covariance sequence features that define the alignment and distinguish the subgroups. This information can be correlated to structure, function, stability, and taxonomy. However, the mathematics of SVD is unfamiliar to many in the field of protein science. Here, we attempt to present an intuitive yet comprehensive description of SVD analysis of MSAs. We begin by describing the underlying mathematics of SVD in a way that is both rigorous and accessible. Next, we use SVD to analyze sequences generated with a simplified model in which the extent of sequence conservation and covariance between different positions is controlled, to show how conservation and covariance produce features in the decomposed coordinate system. We then use SVD to analyze alignments of two protein families, the homeodomain and the Ras superfamilies. Both families show clear evidence of sequence clustering when projected into singular value space. We use k-means clustering to group MSA sequences into specific clusters, show how the residues that distinguish these clusters can be identified, and show how these clusters can be related to taxonomy and function. We end by providing a description a set of Python scripts that can be used for SVD analysis of MSAs, displaying results, and identifying and analyzing sequence clusters. These scripts are freely available on GitHub.
Collapse
Affiliation(s)
- Autum R. Baxter‐Koenigs
- T.C. Jenkins Department of BiophysicsJohns Hopkins UniversityBaltimoreMarylandUSA
- Department of GeneticsHarvard Medical School, New Research Building 0356, 77 Avenue Louis PasteurBostonMassachusetts02115USA
| | - Gina El Nesr
- T.C. Jenkins Department of BiophysicsJohns Hopkins UniversityBaltimoreMarylandUSA
- Program in BiophysicsStanford UniversityStanfordCalifornia94305USA
| | - Doug Barrick
- T.C. Jenkins Department of BiophysicsJohns Hopkins UniversityBaltimoreMarylandUSA
| |
Collapse
|
15
|
Valikhova LV, Kharkov VN, Zarubin AA, Kolesnikov NA, Svarovskaya MG, Khitrinskaya IY, Shtygasheva OV, Volkov VG, Stepanov VA. Genetic Interrelation of the Chulym Turks with Khakass and Kets according to Autosomal SNP Data and Y-Chromosome Haplogroups. RUSS J GENET+ 2022. [DOI: 10.1134/s1022795422100118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
16
|
Estavoyer M, François O. Theoretical analysis of principal components in an umbrella model of intraspecific evolution. Theor Popul Biol 2022; 148:11-21. [PMID: 36122755 DOI: 10.1016/j.tpb.2022.08.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 08/23/2022] [Accepted: 08/23/2022] [Indexed: 10/14/2022]
Abstract
Principal component analysis (PCA) is one of the most frequently-used approach to describe population structure from multilocus genotype data. Regarding geographic range expansions of modern humans, interpretations of PCA have, however, been questioned, as there is uncertainty about the wave-like patterns that have been observed in principal components. It has indeed been argued that wave-like patterns are mathematical artifacts that arise generally when PCA is applied to data in which genetic differentiation increases with geographic distance. Here, we present an alternative theory for the observation of wave-like patterns in PCA. We study a coalescent model - the umbrella model - for the diffusion of genetic variants. The model is based on genetic drift without any particular geographical structure. In the umbrella model, splits from an ancestral population occur almost continuously in time, giving birth to small daughter populations at a regular pace. Our results provide detailed mathematical descriptions of eigenvalues and eigenvectors for the PCA of sampled genomic sequences under the model. When variants uniquely represented in the sample are removed, the PCA eigenvectors are defined as cosine functions of increasing periodicity, reproducing wave-like patterns observed in equilibrium isolation-by-distance models. Including singleton variants in the analysis, the eigenvectors corresponding to the largest eigenvalues exhibit complex wave shapes. The accuracy of our predictions is further investigated with coalescent simulations. Our analysis supports the hypothesis that highly structured wave-like patterns could arise from genetic drift only, and may not always be artificial outcomes of spatially structured data. Genomic data related to the peopling of the Americas are reanalyzed in the light of our new theory.
Collapse
Affiliation(s)
- Maxime Estavoyer
- Université Grenoble-Alpes, Centre National de la Recherche Scientifique, Grenoble INP, TIMC UMR 5525, 38000 Grenoble, France
| | - Olivier François
- Université Grenoble-Alpes, Centre National de la Recherche Scientifique, Grenoble INP, TIMC UMR 5525, 38000 Grenoble, France; Inria Grenoble - Rhône-Alpes Inovallée, 655 Avenue de l'Europe - CS 90051 38334 Montbonnot, France.
| |
Collapse
|
17
|
Elhaik E. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep 2022; 12:14683. [PMID: 36038559 PMCID: PMC9424212 DOI: 10.1038/s41598-022-14395-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 06/06/2022] [Indexed: 12/29/2022] Open
Abstract
Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology, Lund University, 22362, Lund, Sweden.
| |
Collapse
|
18
|
Blood Lines of the British People. Blood 2022. [DOI: 10.1017/9781009205528.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|
19
|
Herzig AF, Ciullo M, Leutenegger AL, Perdry H. Moment estimators of relatedness from low-depth whole-genome sequencing data. BMC Bioinformatics 2022; 23:254. [PMID: 35751014 PMCID: PMC9233360 DOI: 10.1186/s12859-022-04795-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 06/09/2022] [Indexed: 11/29/2022] Open
Abstract
Background Estimating relatedness is an important step for many genetic study designs. A variety of methods for estimating coefficients of pairwise relatedness from genotype data have been proposed. Both the kinship coefficient \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\varphi$$\end{document}φ and the fraternity coefficient \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\psi$$\end{document}ψ for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called. To ignore such uncertainty is known to result in biased estimates. Accordingly, methods have recently been developed to estimate kinship from uncertain genotypes. Results We present new method-of-moment estimators of both the coefficients \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\varphi$$\end{document}φ and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\psi$$\end{document}ψ calculated directly from genotype likelihoods. We have simulated low-depth genetic data for a sample of individuals with extensive relatedness by using the complex pedigree of the known genetic isolates of Cilento in South Italy. Through this simulation, we explore the behaviour of our estimators, demonstrate their properties, and show advantages over alternative methods. A demonstration of our method is given for a sample of 150 French individuals with down-sampled sequencing data. Conclusions We find that our method can provide accurate relatedness estimates whilst holding advantages over existing methods in terms of robustness, independence from external software, and required computation time. The method presented in this paper is referred to as LowKi (Low-depth Kinship) and has been made available in an R package (https://github.com/genostats/LowKi). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04795-8.
Collapse
Affiliation(s)
| | - M Ciullo
- Institute of Genetics and Biophysics A. Buzzati-Traverso - CNR, Naples, Italy.,IRCCS Neuromed, Pozzilli, Isernia, Italy
| | | | - A-L Leutenegger
- Inserm, Université Paris Cité, UMR 1141, NeuroDiderot, 75019, Paris, France
| | - H Perdry
- CESP Inserm U1018, Université Paris-Saclay, UVSQ, Villejuif, France
| |
Collapse
|
20
|
Cong PK, Bai WY, Li JC, Yang MY, Khederzadeh S, Gai SR, Li N, Liu YH, Yu SH, Zhao WW, Liu JQ, Sun Y, Zhu XW, Zhao PP, Xia JW, Guan PL, Qian Y, Tao JG, Xu L, Tian G, Wang PY, Xie SY, Qiu MC, Liu KQ, Tang BS, Zheng HF. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun 2022; 13:2939. [PMID: 35618720 PMCID: PMC9135724 DOI: 10.1038/s41467-022-30526-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 05/05/2022] [Indexed: 01/04/2023] Open
Abstract
We initiate the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing (WGS) individuals and 5,841 high-density genotyping individuals, and identify 81.5 million SNPs and INDELs, of which 38.5% are absent in dbSNP Build 151. We provide a population-specific reference panel and an online imputation server ( https://wbbc.westlake.edu.cn/ ) which could yield substantial improvement of imputation performance in Chinese population, especially for low-frequency and rare variants. By analyzing the singleton density of the WGS data, we find selection signatures in SNX29, DNAH1 and WDR1 genes, and the derived alleles of the alcohol metabolism genes (ADH1A and ADH1B) emerge around 7,000 years ago and tend to be more common from 4,000 years ago in East Asia. Genetic evidence supports the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains, which separate the Han Chinese into subgroups, and we reveal that North Han was more homogeneous than South Han.
Collapse
Affiliation(s)
- Pei-Kuan Cong
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Wei-Yang Bai
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Jin-Chen Li
- Department of Neurology, Xiangya Hospital, Central South University, Changsha, Hunan, China
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center for Medical Genetics & Hunan Key Laboratory, School of Life Sciences, Central South University, Changsha, Hunan, China
| | - Meng-Yuan Yang
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Saber Khederzadeh
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Si-Rui Gai
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Nan Li
- The High-Performance Computing Center, Westlake University, Hangzhou, Zhejiang, China
| | - Yu-Heng Liu
- The High-Performance Computing Center, Westlake University, Hangzhou, Zhejiang, China
| | - Shi-Hui Yu
- Clinical Genome Center, KingMed Diagnostics, Co., Ltd., Guangzhou, Guangdong, China
| | - Wei-Wei Zhao
- Clinical Genome Center, KingMed Diagnostics, Co., Ltd., Guangzhou, Guangdong, China
| | - Jun-Quan Liu
- Clinical Genome Center, KingMed Diagnostics, Co., Ltd., Guangzhou, Guangdong, China
| | - Yi Sun
- Clinical Genome Center, KingMed Diagnostics, Co., Ltd., Guangzhou, Guangdong, China
| | - Xiao-Wei Zhu
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Pian-Pian Zhao
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Jiang-Wei Xia
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Peng-Lin Guan
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Yu Qian
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Jian-Guo Tao
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China
| | - Lin Xu
- WBBC Shandong Center, Binzhou Medical University, Yantai, Shandong, China
| | - Geng Tian
- WBBC Shandong Center, Binzhou Medical University, Yantai, Shandong, China
| | - Ping-Yu Wang
- WBBC Shandong Center, Binzhou Medical University, Yantai, Shandong, China
| | - Shu-Yang Xie
- WBBC Shandong Center, Binzhou Medical University, Yantai, Shandong, China
| | - Mo-Chang Qiu
- WBBC Jiangxi Center, Jiangxi Medical College, Shangrao, Jiangxi, China
| | - Ke-Qi Liu
- WBBC Jiangxi Center, Jiangxi Medical College, Shangrao, Jiangxi, China
| | - Bei-Sha Tang
- Department of Neurology, Xiangya Hospital, Central South University, Changsha, Hunan, China.
- National Clinical Research Center for Geriatric Disorders, Department of Geriatrics, Xiangya Hospital, Central South University, Changsha, Hunan, China.
| | - Hou-Feng Zheng
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China.
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China.
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang, China.
| |
Collapse
|
21
|
Tvedebrink T. Review of the Forensic Applicability of Biostatistical Methods for Inferring Ancestry from Autosomal Genetic Markers. Genes (Basel) 2022; 13:genes13010141. [PMID: 35052480 PMCID: PMC8774801 DOI: 10.3390/genes13010141] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/10/2022] [Accepted: 01/11/2022] [Indexed: 02/01/2023] Open
Abstract
The inference of ancestry has become a part of the services many forensic genetic laboratories provide. Interest in ancestry may be to provide investigative leads or identify the region of origin in cases of unidentified missing persons. There exist many biostatistical methods developed for the study of population structure in the area of population genetics. However, the challenges and questions are slightly different in the context of forensic genetics, where the origin of a specific sample is of interest compared to the understanding of population histories and genealogies. In this paper, the methodologies for modelling population admixture and inferring ancestral populations are reviewed with a focus on their strengths and weaknesses in relation to ancestry inference in the forensic context.
Collapse
Affiliation(s)
- Torben Tvedebrink
- Department of Mathematical Sciences, Aalborg University, DK-9220 Aalborg, Denmark;
- Section of Forensic Genetics, Department of Forensic Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1165 Copenhagen, Denmark
| |
Collapse
|
22
|
Malle S. Population Structure and Relatedness for Genome-Wide Association Studies. Methods Mol Biol 2022; 2481:185-196. [PMID: 35641766 DOI: 10.1007/978-1-0716-2237-7_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The estimation of the population structure and genetic relatedness between individuals within a collection of accessions is important in the formation of core collections for the conservation of genetic resources, uncovering the demographic history of the population under study, as well as for association studies. With the recent development of high-throughput genotyping technologies, several algorithms and methods have been developed and implemented in software to estimate the extent of genetic diversity between individuals. In this chapter, our objective is to describe methods to capture population structure and relatedness in a step-by-step fashion. To exemplify the process, two pruned datasets (14K and 243K SNP markers) were used to investigate population structure and relatedness among a soybean GWAS panel using different approaches and methods.
Collapse
Affiliation(s)
- Sidiki Malle
- Assistant professor at Institut Polytechnique Rural de Formation et de Recherche Appliquée (IPR/IFRA) de Katibougou, Koulikoro, Mali.
| |
Collapse
|
23
|
Brinster R, Scherer D, Lorenzo Bermejo J. Optimal selection of genetic variants for adjustment of population stratification in European association studies. Brief Bioinform 2021; 21:753-761. [PMID: 30863848 DOI: 10.1093/bib/bbz023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 01/24/2019] [Accepted: 02/10/2019] [Indexed: 01/14/2023] Open
Abstract
Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations-so-called ancestry-informative markers (AIMs)-instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case-control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.
Collapse
Affiliation(s)
- Regina Brinster
- Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 130.3, Heidelberg, Germany
| | - Dominique Scherer
- Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 130.3, Heidelberg, Germany
| | - Justo Lorenzo Bermejo
- Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 130.3, Heidelberg, Germany
| |
Collapse
|
24
|
Calò CM, Vona G, Robledo R, Francalacci P. From old markers to next generation: reconstructing the history of the peopling of Sardinia. Ann Hum Biol 2021; 48:203-212. [PMID: 34459339 DOI: 10.1080/03014460.2021.1944312] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
CONTEXT For many years the Sardinian population has been the object of numerous studies because of its unique genetic structure. Despite the extreme abundance of papers, various aspects of the peopling and genetic structure of Sardinia still remain uncertain and sometimes controversial. OBJECTIVE We reviewed what has emerged from different studies, focussing on some still open questions, such as the origin of Sardinians, their relationship with the Corsican population, and the intra-regional genetic heterogeneity. METHODS The various issues have been addressed through the analysis of classical markers, molecular markers and, finally, genomic data through next generation sequencing. RESULTS AND CONCLUSIONS Although the most ancient human remains date back to the end of the Palaeolithic, Mesolithic populations brought founding lineages that left evident traces in the modern population. Then, with the Neolithic, the island underwent an important demographic expansion. Subsequently, isolation and genetic drift contributed to maintain a significant genetic heterogeneity, but preserving the overall homogeneity on a regional scale. At the same time, isolation and genetic drift contributed to differentiate Sardinia from Corsica, which saw an important gene flow from the mainland. However, the isolation did not prevent gene flow from the neighbouring populations whose contribution are still recognisable in the genome of Sardinians.
Collapse
Affiliation(s)
- Carla Maria Calò
- Department of Life and Environmental Sciences, University of Cagliari, Cagliari, Italy
| | - Giuseppe Vona
- Department of Life and Environmental Sciences, University of Cagliari, Cagliari, Italy
| | - Renato Robledo
- Department of Biomedical Sciences, University of Cagliari, Cagliari, Italy
| | - Paolo Francalacci
- Department of Life and Environmental Sciences, University of Cagliari, Cagliari, Italy
| |
Collapse
|
25
|
A spectral theory for Wright's inbreeding coefficients and related quantities. PLoS Genet 2021; 17:e1009665. [PMID: 34280184 PMCID: PMC8320931 DOI: 10.1371/journal.pgen.1009665] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 07/29/2021] [Accepted: 06/13/2021] [Indexed: 12/20/2022] Open
Abstract
Wright’s inbreeding coefficient, FST, is a fundamental measure in population genetics. Assuming a predefined population subdivision, this statistic is classically used to evaluate population structure at a given genomic locus. With large numbers of loci, unsupervised approaches such as principal component analysis (PCA) have, however, become prominent in recent analyses of population structure. In this study, we describe the relationships between Wright’s inbreeding coefficients and PCA for a model of K discrete populations. Our theory provides an equivalent definition of FST based on the decomposition of the genotype matrix into between and within-population matrices. The average value of Wright’s FST over all loci included in the genotype matrix can be obtained from the PCA of the between-population matrix. Assuming that a separation condition is fulfilled and for reasonably large data sets, this value of FST approximates the proportion of genetic variation explained by the first (K − 1) principal components accurately. The new definition of FST is useful for computing inbreeding coefficients from surrogate genotypes, for example, obtained after correction of experimental artifacts or after removing adaptive genetic variation associated with environmental variables. The relationships between inbreeding coefficients and the spectrum of the genotype matrix not only allow interpretations of PCA results in terms of population genetic concepts but extend those concepts to population genetic analyses accounting for temporal, geographical and environmental contexts. Principal component analysis (PCA) is the most-frequently used approach to describe population genetic structure from large population genomic data sets. In this study, we show that PCA not only estimates ancestries of sampled individuals, but also computes the average value of Wright’s inbreeding coefficient over the loci included in the genotype matrix. Our result shows that inbreeding coefficients and PCA eigenvalues provide equivalent descriptions of population structure. As a consequence, PCA extends the definition of those coefficients beyond the framework of allelic frequencies. We give examples on how FST can be computed from ancient DNA samples for which genotypes are corrected for coverage, and in an ecological genomic example where a proportion of genetic variation is explained by environmental variables.
Collapse
|
26
|
Abdel Moniem H, Yusuf MS, Chen G. Ecology and population structure of some indigenous geese breeds and the impact of four GH and Pit-1 SNPs on their body weights. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2021; 28:37603-37615. [PMID: 33715132 DOI: 10.1007/s11356-021-13402-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 03/08/2021] [Indexed: 06/12/2023]
Abstract
This study aims to determine the genetic correlation using nine microsatellite markers to reconstruct the history of some indigenous geese populations, along with the use of four single nucleotides polymorphisms (SNPs) to investigate their correlation with the geese body weight. Microsatellite markers are mainly used to provide updated information on changes in the population structure of geese breeds. The eight goose populations reported 24% private alleles specific for each population. Expected heterozygosity (He) ranged from 0.46 to 0.70. Three breeds were reported highly polymorphic. Inbreeding coefficient (Fis) revealed that three breeds were in a minimum level of extinction danger, while one breed was in a potential endangered situation. Phylogenetic tree, principal component analysis (PCA), and self-organizing map (SOM) were constructed using MATLAB to study the population distribution and relationship among these breeds. Four SNPs were detected, two SNPs at GH gene exon (C123T and C158T), and two SNPs at Pit-1 gene exons (G161A and T282G). Four SNP loci were reported to have a significant effect on geese body weight. They were CT genotype for C123T locus, TT genotype for C158T locus, GG genotype for G161A locus, and GG genotype for T282G locus.
Collapse
Affiliation(s)
- Hebatallah Abdel Moniem
- College of Animal Science and Technology, Yangzhou University , Yangzhou , 225009 , China
- Department of Animal Wealth Development, Faculty of Veterinary Medicine, Suez Canal University, Ismailia, 41522, Egypt
| | - Mohamed Sayed Yusuf
- Department of Nutrition and Clinical Nutrition, Faculty of Veterinary Medicine, Suez Canal University, Ismailia, 41522, Egypt
| | - Guohong Chen
- College of Animal Science and Technology, Yangzhou University , Yangzhou , 225009 , China.
| |
Collapse
|
27
|
Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, Moscati A, Wojcik GL, Shemirani R, Beckmann ND, Cohain A, Sorokin EP, Park DS, Ambite JL, Ellis S, Auton A, Bottinger EP, Cho JH, Loos RJF, Abul-Husn NS, Zaitlen NA, Gignoux CR, Kenny EE. Toward a fine-scale population health monitoring system. Cell 2021; 184:2068-2083.e11. [PMID: 33861964 DOI: 10.1016/j.cell.2021.03.034] [Citation(s) in RCA: 83] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 11/18/2020] [Accepted: 03/12/2021] [Indexed: 12/22/2022]
Abstract
Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens and environmental factors impacting specific sub-populations. Here, we propose a framework for repurposing data from electronic health records (EHRs) in concert with genomic data to explore the demographic ties that can impact disease burdens. Using data from a diverse biobank in New York City, we identified 17 communities sharing recent genetic ancestry. We observed 1,177 health outcomes that were statistically associated with a specific group and demonstrated significant differences in the segregation of genetic variants contributing to Mendelian diseases. We also demonstrated that fine-scale population structure can impact the prediction of complex disease risk within groups. This work reinforces the utility of linking genomic data to EHRs and provides a framework toward fine-scale monitoring of population health.
Collapse
Affiliation(s)
- Gillian M Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Sinead Cullina
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Stephane Wenric
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Emily R Soper
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Benjamin S Glicksberg
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Denis Torre
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Arden Moscati
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Genevieve L Wojcik
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Ruhollah Shemirani
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Noam D Beckmann
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ariella Cohain
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Elena P Sorokin
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Danny S Park
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Jose-Luis Ambite
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Steve Ellis
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Adam Auton
- Department of Genetics, Albert Einstein College of Medicine, New York, NY 10461, USA
| | - Erwin P Bottinger
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Judy H Cho
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ruth J F Loos
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noura S Abul-Husn
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noah A Zaitlen
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA 90033, USA
| | - Christopher R Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
| |
Collapse
|
28
|
Battey CJ, Coffing GC, Kern AD. Visualizing population structure with variational autoencoders. G3 (BETHESDA, MD.) 2021; 11:jkaa036. [PMID: 33561250 PMCID: PMC8022710 DOI: 10.1093/g3journal/jkaa036] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 12/15/2020] [Indexed: 11/13/2022]
Abstract
Dimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs)-generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data-for visualizing population genetic variation. VAEs incorporate nonlinear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.
Collapse
Affiliation(s)
- C J Battey
- Department of Biology, University of Oregon Institute of Ecology and Evolution, Eugene, Oregon, 97403
| | - Gabrielle C Coffing
- Department of Biology, University of Oregon Institute of Ecology and Evolution, Eugene, Oregon, 97403
| | - Andrew D Kern
- Department of Biology, University of Oregon Institute of Ecology and Evolution, Eugene, Oregon, 97403
| |
Collapse
|
29
|
Abstract
Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. Understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. Here we provide a protocol for running principal component analysis (PCA) and admixture proportion inference-two of the most commonly used approaches in describing population structure. Along with hands-on examples with CEPH-Human Genome Diversity Panel and pragmatic caveats, readers will learn to analyze and visualize population structure on their own data.
Collapse
|
30
|
Santos P, Gonzàlez-Fortes G, Trucchi E, Ceolin A, Cordoni G, Guardiano C, Longobardi G, Barbujani G. More Rule than Exception: Parallel Evidence of Ancient Migrations in Grammars and Genomes of Finno-Ugric Speakers. Genes (Basel) 2020; 11:E1491. [PMID: 33322364 PMCID: PMC7763979 DOI: 10.3390/genes11121491] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 11/25/2020] [Accepted: 12/09/2020] [Indexed: 11/27/2022] Open
Abstract
To reconstruct aspects of human demographic history, linguistics and genetics complement each other, reciprocally suggesting testable hypotheses on population relationships and interactions. Relying on a linguistic comparative method based on syntactic data, here we focus on the non-straightforward relation of genes and languages among Finno-Ugric (FU) speakers, in comparison to their Indo-European (IE) and Altaic (AL) neighbors. Syntactic analysis, in agreement with the indications of more traditional linguistic levels, supports at least three distinct clusters, corresponding to these three Eurasian families; yet, the outliers of the FU group show linguistic convergence with their geographical neighbors. By analyzing genome-wide data in both ancient and contemporary populations, we uncovered remarkably matching patterns, with north-western FU speakers linguistically and genetically closer in parallel degrees to their IE-speaking neighbors, and eastern FU speakers to AL speakers. Therefore, our analysis indicates that plausible cross-family linguistic interference effects were accompanied, and possibly caused, by recognizable demographic processes. In particular, based on the comparison of modern and ancient genomes, our study identified the Pontic-Caspian steppes as the possible origin of the demographic processes that led to the expansion of FU languages into Europe.
Collapse
Affiliation(s)
- Patrícia Santos
- CNRS, UMR 5199—PACEA, Université de Bordeaux, Bâtiment B8, Allée Geoffroy Saint Hilaire, 33615 Pessac, France;
- Dipartimento di Scienze della Vita e Biotecnologie, Università di Ferrara, 44121 Ferrara, Italy;
| | - Gloria Gonzàlez-Fortes
- Dipartimento di Scienze della Vita e Biotecnologie, Università di Ferrara, 44121 Ferrara, Italy;
| | - Emiliano Trucchi
- Department of Life and Environmental Sciences, Marche Polytechnic University, 60131 Ancona, Italy;
| | - Andrea Ceolin
- Dipartimento di Comunicazione ed Economia, Università di Modena e Reggio Emilia, 42121 Reggio Emilia, Italy; (A.C.); (C.G.)
| | - Guido Cordoni
- School of Veterinary Medicine, University of Surrey, Guildford GU2 7AL, UK;
| | - Cristina Guardiano
- Dipartimento di Comunicazione ed Economia, Università di Modena e Reggio Emilia, 42121 Reggio Emilia, Italy; (A.C.); (C.G.)
| | - Giuseppe Longobardi
- Department of Language and Linguistic Science, University of York, York YO10 5DD, UK;
| | - Guido Barbujani
- Dipartimento di Scienze della Vita e Biotecnologie, Università di Ferrara, 44121 Ferrara, Italy;
| |
Collapse
|
31
|
Abstract
Understanding the influence of genetics on human disease is among the primary goals for biology and medicine. To this end, the direct study of natural human genetic variation has provided valuable insights into human physiology and disease as well as into the origins and migrations of humans. In this review, we discuss the foundations of population genetics, which provide a crucial context to the study of human genes and traits. In particular, genome-wide association studies and similar methods have revealed thousands of genetic loci associated with diseases and traits, providing invaluable information into the biology of these traits. Simultaneously, as the study of rare genetic variation has expanded, so-called human knockouts have elucidated the function of human genes and the therapeutic potential of targeting them.
Collapse
Affiliation(s)
- Konrad J. Karczewski
- Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;,
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | - Alicia R. Martin
- Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;,
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| |
Collapse
|
32
|
Scheper C, Bohlouli M, Brügemann K, Weimann C, Vanvanhossou SFU, König S, Dossa LH. The role of agro-ecological factors and transboundary transhumance in shaping the genetic diversity in four indigenous cattle populations of Benin. J Anim Breed Genet 2020; 137:622-640. [PMID: 32672901 DOI: 10.1111/jbg.12495] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 05/29/2020] [Accepted: 06/17/2020] [Indexed: 01/03/2023]
Abstract
The indigenous cattle population of Benin is a diverse mix of taurine and hybrid breeds shaped by diverse ecological and climatic conditions with eight agro-ecological zones (AEZ). Presumably, the taurine breeds face current endangerment due to ongoing indicine introgression following climate change and transboundary transhumance. The aim of the study was to investigate the genetic diversity and population structure of the indigenous breeds Lagune, Somba, Pabli and Borgou considering spatial agro-ecological and socio-economic factors (transhumance) based on 50k SNP and microsatellite data. Among the four sampled breeds, six genetic clusters were identified using model-free (discriminant analysis of principal components) and model-based (TESS and ADMIXTURE) methods separating taurine from hybrid breeds. Results based on an extension with publicly available historic SNP data sets from taurine and indicine West African cattle and additional outgroups provided additional insight into changes of genetic structure in the sampled breeds over time. Both taurine breeds, Somba and Lagune, showed a stable foundation but also spatially limited partial indicine introgression associated with transhumance leading to high genetic diversity. In addition, we found evidence for spatial diversity and changes in genetic structure over time in the Borgou breed in comparison of our samples with the historic samples which could be explained by potential continuous indicine introgression into the Borgou breed in two sample regions. Results for the Pabli breed do not conclusively point to full absorbance by the Borgou in comparison with all available Borgou samples. Further research is needed in this regard.
Collapse
Affiliation(s)
- Carsten Scheper
- Institute of Animal Breeding and Genetics, Justus-Liebig-University of Gießen, Gießen, Germany
| | - Mehdi Bohlouli
- Institute of Animal Breeding and Genetics, Justus-Liebig-University of Gießen, Gießen, Germany
| | - Kerstin Brügemann
- Institute of Animal Breeding and Genetics, Justus-Liebig-University of Gießen, Gießen, Germany
| | - Christina Weimann
- Institute of Animal Breeding and Genetics, Justus-Liebig-University of Gießen, Gießen, Germany
| | | | - Sven König
- Institute of Animal Breeding and Genetics, Justus-Liebig-University of Gießen, Gießen, Germany
| | - Luc Hippolyte Dossa
- Ecole des Sciences et Techniques de Production Animale, Faculté des Sciences Agronomiques, Université d'Abomey-Calavi, Cotonou, Bénin
| |
Collapse
|
33
|
Kist NC, Lambert B, Campbell S, Katzourakis A, Lunn D, Lemey P, Iversen AKN. HIV-1 p24Gag adaptation to modern and archaic HLA-allele frequency differences in ethnic groups contributes to viral subtype diversification. Virus Evol 2020; 6:veaa085. [PMID: 33343925 PMCID: PMC7733611 DOI: 10.1093/ve/veaa085] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Pathogen-driven selection and past interbreeding with archaic human lineages have resulted in differences in human leukocyte antigen (HLA)-allele frequencies between modern human populations. Whether or not this variation affects pathogen subtype diversification is unknown. Here we show a strong positive correlation between ethnic diversity in African countries and both human immunodeficiency virus (HIV)-1 p24gag and subtype diversity. We demonstrate that ethnic HLA-allele differences between populations have influenced HIV-1 subtype diversification as the virus adapted to escape common antiviral immune responses. The evolution of HIV Subtype B (HIV-B), which does not appear to be indigenous to Africa, is strongly affected by immune responses associated with Eurasian HLA variants acquired through adaptive introgression from Neanderthals and Denisovans. Furthermore, we show that the increasing and disproportionate number of HIV-infections among African Americans in the USA drive HIV-B evolution towards an Africa-centric HIV-1 state. Similar adaptation of other pathogens to HLA variants common in affected populations is likely.
Collapse
Affiliation(s)
- Nicolaas C Kist
- Division of Clinical Neurology, Nuffield Department of Clinical Neurosciences, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
| | - Ben Lambert
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
- Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, Imperial College London, Medical School Building St Mary’s Campus, Norfolk Place, London W2 1PG, UK
| | - Samuel Campbell
- Division of Clinical Neurology, Nuffield Department of Clinical Neurosciences, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
| | - Aris Katzourakis
- Department of Zoology, University of Oxford, South Parks Road, Oxford OX1 3PS, UK
| | - Daniel Lunn
- Department of Statistics, University of Oxford, St Giles’, Oxford OX1 3LB, UK
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute for Medical Research, KU Leuven - University of Leuven, Leuven B-3000, Belgium
| | - Astrid K N Iversen
- Division of Clinical Neurology, Nuffield Department of Clinical Neurosciences, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| |
Collapse
|
34
|
Branco C, Ray N, Currat M, Arenas M. Influence of Paleolithic range contraction, admixture and long-distance dispersal on genetic gradients of modern humans in Asia. Mol Ecol 2020; 29:2150-2159. [PMID: 32436243 DOI: 10.1111/mec.15479] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 05/08/2020] [Accepted: 05/11/2020] [Indexed: 12/29/2022]
Abstract
Cavalli-Sforza and coauthors originally explored the genetic variation of modern humans throughout the world and observed an overall east-west genetic gradient in Asia. However, the specific environmental and population genetics processes causing this gradient were not formally investigated and promoted discussion in recent studies. Here we studied the influence of diverse environmental and population genetics processes on Asian genetic gradients and identified which could have produced the observed gradient. To do so, we performed extensive spatially-explicit computer simulations of genetic data under the following scenarios: (a) variable levels of admixture between Paleolithic and Neolithic populations, (b) migration through long-distance dispersal (LDD), (c) Paleolithic range contraction induced by the last glacial maximum (LGM), and (d) Neolithic range expansions from one or two geographic origins (the Fertile Crescent and the Yangzi and Yellow River Basins). Next, we estimated genetic gradients from the simulated data and we found that they were sensible to the analysed processes, especially to the range contraction induced by LGM and to the number of Neolithic expansions. Some scenarios were compatible with the observed east-west genetic gradient, such as the Paleolithic expansion with a range contraction induced by the LGM or two Neolithic range expansions from both the east and the west. In general, LDD increased the variance of genetic gradients among simulations. We interpreted the obtained gradients as a consequence of both allele surfing caused by range expansions and isolation by distance along the vast east-west geographic axis of this continent.
Collapse
Affiliation(s)
- Catarina Branco
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain.,Biomedical Research Center (CINBIO), University of Vigo, Vigo, Spain
| | - Nicolas Ray
- GeoHealth Group, Institute of Global Health, University of Geneva, Geneva, Switzerland.,Institute for Environmental Sciences, University of Geneva, Geneva, Switzerland
| | - Mathias Currat
- Laboratory of Anthropology, Genetics and Peopling History, Department of Genetics and Evolution - Anthropology Unit, University of Geneva, Geneva, Switzerland.,Institute of Genetics and Genomics in Geneva (IGE3), University of Geneva, Geneva, Switzerland
| | - Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain.,Biomedical Research Center (CINBIO), University of Vigo, Vigo, Spain
| |
Collapse
|
35
|
Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 2020; 35:3679-3683. [PMID: 30957838 DOI: 10.1093/bioinformatics/btz157] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 02/26/2019] [Accepted: 04/04/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. RESULTS We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. AVAILABILITY AND IMPLEMENTATION Source code and documentation are both available at https://github.com/aritra90/TeraPCA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aritra Bose
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| | - Vassilis Kalantzis
- IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
| | | | - Mai Elkady
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| | - Peristera Paschou
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Petros Drineas
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
36
|
Feldman MW. L. Luca Cavalli-Sforza: A Renaissance Scientist. Theor Popul Biol 2020; 133:75-79. [DOI: 10.1016/j.tpb.2019.11.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 11/26/2019] [Indexed: 01/01/2023]
|
37
|
Aoki K. A three-population wave-of-advance model for the European early Neolithic. PLoS One 2020; 15:e0233184. [PMID: 32428013 PMCID: PMC7237037 DOI: 10.1371/journal.pone.0233184] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2020] [Accepted: 04/29/2020] [Indexed: 11/19/2022] Open
Abstract
Ancient DNA studies have shown that early farming spread through most of Europe by the range expansion of farmers of Anatolian origin rather than by the conversion to farming of the local hunter-gatherers, and have confirmed that these hunter-gatherers continued to coexist with the incoming farmers. In this short report, I extend a previous three-population wave-of-advance model to accommodate these new findings, and derive the conditions supportive of such a scenario in terms of the relative magnitudes of the parameters. The revised model predicts that the conversion rate must, not surprisingly, be low, but also that the hunter-gatherers must compete more strongly with the converted farmers than with the alien farmers. Moreover, competition with the hunter-gatherers diminishes the speed of the wave-of advance of the farmers. In addition, I briefly consider how the wave-of-advance approach may contribute to interpreting the results of archaeological studies using the summed probability distribution of radiocarbon dates.
Collapse
Affiliation(s)
- Kenichi Aoki
- Organization for the Strategic Coordination of Research and Intellectual Properties, Meiji University, Nakano-ku, Tokyo, Japan
- * E-mail:
| |
Collapse
|
38
|
Abdellaoui A, Hugh-Jones D, Yengo L, Kemper KE, Nivard MG, Veul L, Holtz Y, Zietsch BP, Frayling TM, Wray NR, Yang J, Verweij KJH, Visscher PM. Genetic correlates of social stratification in Great Britain. Nat Hum Behav 2019; 3:1332-1342. [PMID: 31636407 DOI: 10.1038/s41562-019-0757-5] [Citation(s) in RCA: 128] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 09/18/2019] [Indexed: 02/07/2023]
Abstract
Human DNA polymorphisms vary across geographic regions, with the most commonly observed variation reflecting distant ancestry differences. Here we investigate the geographic clustering of common genetic variants that influence complex traits in a sample of ~450,000 individuals from Great Britain. Of 33 traits analysed, 21 showed significant geographic clustering at the genetic level after controlling for ancestry, probably reflecting migration driven by socioeconomic status (SES). Alleles associated with educational attainment (EA) showed the most clustering, with EA-decreasing alleles clustering in lower SES areas such as coal mining areas. Individuals who leave coal mining areas carry more EA-increasing alleles on average than those in the rest of Great Britain. The level of geographic clustering is correlated with genetic associations between complex traits and regional measures of SES, health and cultural outcomes. Our results are consistent with the hypothesis that social stratification leaves visible marks in geographic arrangements of common allele frequencies and gene-environment correlations.
Collapse
Affiliation(s)
- Abdel Abdellaoui
- Department of Psychiatry, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.
| | | | - Loic Yengo
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
| | - Kathryn E Kemper
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
| | - Michel G Nivard
- Department of Biological Psychology, VU University, Amsterdam, The Netherlands
| | - Laura Veul
- Department of Psychiatry, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands
| | - Yan Holtz
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
| | - Brendan P Zietsch
- School of Psychology, University of Queensland, Brisbane, Queensland, Australia
| | - Timothy M Frayling
- Genetics of Complex Traits, University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Naomi R Wray
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
- Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia
| | - Jian Yang
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
- Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia
| | - Karin J H Verweij
- Department of Psychiatry, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands
| | - Peter M Visscher
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia.
- Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia.
| |
Collapse
|
39
|
Greenbaum G, Rubin A, Templeton AR, Rosenberg NA. Network-based hierarchical population structure analysis for large genomic data sets. Genome Res 2019; 29:2020-2033. [PMID: 31694865 PMCID: PMC6886512 DOI: 10.1101/gr.250092.119] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 11/01/2019] [Indexed: 01/24/2023]
Abstract
Analysis of population structure in natural populations using genetic data is a common practice in ecological and evolutionary studies. With large genomic data sets of populations now appearing more frequently across the taxonomic spectrum, it is becoming increasingly possible to reveal many hierarchical levels of structure, including fine-scale genetic clusters. To analyze these data sets, methods need to be appropriately suited to the challenges of extracting multilevel structure from whole-genome data. Here, we present a network-based approach for constructing population structure representations from genetic data. The use of community-detection algorithms from network theory generates a natural hierarchical perspective on the representation that the method produces. The method is computationally efficient, and it requires relatively few assumptions regarding the biological processes that underlie the data. We show the approach by analyzing population structure in the model plant species Arabidopsis thaliana and in human populations. These examples illustrate how network-based approaches for population structure analysis are well-suited to extracting valuable ecological and evolutionary information in the era of large genomic data sets.
Collapse
Affiliation(s)
- Gili Greenbaum
- Department of Biology, Stanford University, Stanford, California 94305, USA
| | - Amir Rubin
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er-Sheva, 8410501, Israel
| | - Alan R Templeton
- Department of Biology, Washington University, St. Louis, Missouri 63130, USA
- Department of Evolutionary and Environmental Ecology, University of Haifa, Haifa, 31905, Israel
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
40
|
A method for genome-wide genealogy estimation for thousands of samples. Nat Genet 2019; 51:1321-1329. [PMID: 31477933 DOI: 10.1038/s41588-019-0484-x] [Citation(s) in RCA: 257] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 07/15/2019] [Indexed: 01/29/2023]
Abstract
Knowledge of genome-wide genealogies for thousands of individuals would simplify most evolutionary analyses for humans and other species, but has remained computationally infeasible. We have developed a method, Relate, scaling to >10,000 sequences while simultaneously estimating branch lengths, mutational ages and variable historical population sizes, as well as allowing for data errors. Application to 1,000 Genomes Project haplotypes produces joint genealogical histories for 26 human populations. Highly diverged lineages are present in all groups, but most frequent in Africa. Outside Africa, these mainly reflect ancient introgression from groups related to Neanderthals and Denisovans, while African signals instead reflect unknown events unique to that continent. Our approach allows more powerful inferences of natural selection than has previously been possible. We identify multiple regions under strong positive selection, and multi-allelic traits including hair color, body mass index and blood pressure, showing strong evidence of directional selection, varying among human groups.
Collapse
|
41
|
Abstract
Surname distribution can be a useful tool for studying the genetic structure of a human population. In South America, the Uruguay population has traditionally been considered to be of European ancestry, despite its trihybrid origin, as proved through genetics. The aim of this study was to investigate the structure of the Uruguayan population, resulting from population movements and surname drift in the country. The distribution of the surnames of 2,501,774 people on the electoral register was studied in the nineteen departments of Uruguay. Multivariate approaches were used to estimate isonymic parameters. Isolation by Distance was measured by correlating isonymic and geographic distances. In the study sample, the most frequent surnames were consistently Spanish, reflecting the fact that the first immigration waves occurred before Uruguayan independence. Only a few surnames of Native origin were recorded. The effective surname number (α) for the entire country was 302, and the average for departments was 235.8 ± 19. Inbreeding estimates were lower in the south-west of the country and in the densely populated Montevideo area. Isonymic distances between departments were significantly correlated with linear geographic distance (p < 0.001) indicating continuously increasing surname distances up to 400 km. Surnames form clusters related to geographic regions affected by different historical processes. The isonymic structure of Uruguay shows a radiation towards the east and north, with short-range migration playing a major role, while the contribution of drift, considering the small variance of α, appears to be minor.
Collapse
|
42
|
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis. G3-GENES GENOMES GENETICS 2019; 9:2447-2461. [PMID: 31151998 PMCID: PMC6686921 DOI: 10.1534/g3.118.200925] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi.
Collapse
|
43
|
Josephs EB, Berg JJ, Ross-Ibarra J, Coop G. Detecting Adaptive Differentiation in Structured Populations with Genomic Data and Common Gardens. Genetics 2019; 211:989-1004. [PMID: 30679259 PMCID: PMC6404252 DOI: 10.1534/genetics.118.301786] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 01/15/2019] [Indexed: 12/21/2022] Open
Abstract
Adaptation in quantitative traits often occurs through subtle shifts in allele frequencies at many loci-a process called polygenic adaptation. While a number of methods have been developed to detect polygenic adaptation in human populations, we lack clear strategies for doing so in many other systems. In particular, there is an opportunity to develop new methods that leverage datasets with genomic data and common garden trait measurements to systematically detect the quantitative traits important for adaptation. Here, we develop methods that do just this, using principal components of the relatedness matrix to detect excess divergence consistent with polygenic adaptation, and using a conditional test to control for confounding effects due to population structure. We apply these methods to inbred maize lines from the United States Department of Agriculture germplasm pool and maize landraces from Europe. Ultimately, these methods can be applied to additional domesticated and wild species to give us a broader picture of the specific traits that contribute to adaptation and the overall importance of polygenic adaptation in shaping quantitative trait variation.
Collapse
Affiliation(s)
- Emily B Josephs
- Department of Evolution and Ecology, University of California, Davis, California 95616
- Center for Population Biology, University of California, Davis, California 95616
| | - Jeremy J Berg
- Department of Biological Sciences, Columbia University, New York, New York 10027
| | - Jeffrey Ross-Ibarra
- Department of Plant Sciences, University of California, Davis, California 95616
- Center for Population Biology, University of California, Davis, California 95616
| | - Graham Coop
- Department of Evolution and Ecology, University of California, Davis, California 95616
- Center for Population Biology, University of California, Davis, California 95616
| |
Collapse
|
44
|
Patterns of genetic differentiation and the footprints of historical migrations in the Iberian Peninsula. Nat Commun 2019; 10:551. [PMID: 30710075 PMCID: PMC6358624 DOI: 10.1038/s41467-018-08272-w] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Accepted: 12/07/2018] [Indexed: 12/31/2022] Open
Abstract
The Iberian Peninsula is linguistically diverse and has a complex demographic history, including a centuries-long period of Muslim rule. Here, we study the fine-scale genetic structure of its population, and the genetic impacts of historical events, leveraging powerful, haplotype-based statistical methods to analyse 1413 individuals from across Spain. We detect extensive fine-scale population structure at extremely fine scales (below 10 Km) in some regions, including Galicia. We identify a major east-west axis of genetic differentiation, and evidence of historical north to south population movement. We find regionally varying fractions of north-west African ancestry (0–11%) in modern-day Iberians, related to an admixture event involving European-like and north-west African-like source populations. We date this event to 860–1120 CE, implying greater genetic impacts in the early half of Muslim rule in Iberia. Together, our results indicate clear genetic impacts of population movements associated with both the Muslim conquest and the subsequent Reconquista. The Iberian Peninsula has a complex history. Here, the authors analyse the genetic structure of the modern Iberian population at fine scale, revealing historical population movements associated with the time of Muslim rule.
Collapse
|
45
|
Anadromy Redux? Genetic Analysis to Inform Development of an Indigenous American River Steelhead Broodstock. JOURNAL OF FISH AND WILDLIFE MANAGEMENT 2019. [DOI: 10.3996/072018-jfwm-063] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Abstract
The construction of dams and water diversions has severely limited access to spawning habitat for anadromous fishes. To mitigate for these impacts, hatchery programs rear and release millions of juvenile salmonids, including steelhead, the anadromous ecotype of the species Oncorhynchus mykiss. These programs sometimes use nonindigenous broodstock sources that may have negative effects on wild populations. In California, however, only one anadromous fish hatchery program currently uses nonnative broodstock: the steelhead program at Nimbus Fish Hatchery on the American River, a tributary of the Sacramento River in the California Central Valley. The goal of this study was to determine if potentially appropriate sources to replace the broodstock for the Nimbus Hatchery steelhead program exist in the Upper American River, above Nimbus and Folsom dams. We show that all Upper American River O. mykiss sampled share ancestry with other populations in the Central Valley steelhead distinct population segment, with limited introgression from out-of-basin sources in some areas. Furthermore, some Upper American River populations retain adaptive genomic variation associated with a migratory life history, supporting the hypothesis that these populations display adfluvial migratory behavior. Together, these results provide insights into the evolution of trout populations above barrier dams. We conclude that some Upper American River O. mykiss populations represent genetically appropriate sources from which fisheries managers could potentially develop a new broodstock for the Nimbus Hatchery steelhead program to reestablish a native anadromous population in the Lower American River and contribute to recovery of the threatened Central Valley steelhead distinct population segment.
Collapse
|
46
|
Hansen CCR, Hvilsom C, Schmidt NM, Aastrup P, Van Coeverden de Groot PJ, Siegismund HR, Heller R. The Muskox Lost a Substantial Part of Its Genetic Diversity on Its Long Road to Greenland. Curr Biol 2018; 28:4022-4028.e5. [DOI: 10.1016/j.cub.2018.10.054] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2018] [Revised: 07/06/2018] [Accepted: 10/26/2018] [Indexed: 01/12/2023]
|
47
|
Fort J, Mercè Pareta M, Sørensen L. Estimating the relative importance of demic and cultural diffusion in the spread of the Neolithic in Scandinavia. J R Soc Interface 2018; 15:20180597. [PMID: 30464058 PMCID: PMC6283996 DOI: 10.1098/rsif.2018.0597] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 10/29/2018] [Indexed: 12/05/2022] Open
Abstract
Using a database of early farming sites in Scandinavia, we estimate that the spread rate of the Neolithic was in the range 0.44-0.66 km yr-1 This is substantially slower (by about 50%) than the rate in continental Europe. We interpret this result in the framework of a new mathematical model that includes horizontal cultural transmission (acculturation), vertical cultural transmission (interbreeding) and demic diffusion (reproduction and dispersal of farmers). To parametrize the model, we estimate reproduction rates of early farmers using archaeological data (sum-calibrated probabilities for the dates of early Neolithic Scandinavian sites) and use them in a wave-of-advance model for the first time. Comparing the model with the archaeological data, we find that the percentage of the spread rate due to cultural diffusion is below 50% (except for very extreme parameter values, and even for them it is below 54%). This strongly suggests that the spread of the Neolithic in Scandinavia was driven mainly by demic diffusion. This conclusion, obtained from archaeological data, agrees qualitatively with the implications of ancient genetic data, but the latter are yet too few in Scandinavia to produce any quantitative percentage for the spread rate due to cultural diffusion. We also find that, on average, fewer than eight hunter-gatherers were incorporated in the Neolithic communities by each group of 10 pioneering farmers, via horizontal and/or vertical cultural transmission.
Collapse
Affiliation(s)
- Joaquim Fort
- Complex Systems Laboratory, University of Girona, C/. Maria Aurèlia Capmany 61, 17003 Girona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), C/. Lluís Companys 23, 08010 Barcelona, Spain
| | - Maria Mercè Pareta
- Complex Systems Laboratory, University of Girona, C/. Maria Aurèlia Capmany 61, 17003 Girona, Spain
| | - Lasse Sørensen
- Ancient Cultures of Denmark and the Mediterranean, The National Museum of Denmark, Frederiksholms Kanal 12, 1220 Copenhagen K, Denmark
| |
Collapse
|
48
|
Local PCA Shows How the Effect of Population Structure Differs Along the Genome. Genetics 2018; 211:289-304. [PMID: 30459280 DOI: 10.1534/genetics.118.301747] [Citation(s) in RCA: 105] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 11/05/2018] [Indexed: 11/18/2022] Open
Abstract
Population structure leads to systematic patterns in measures of mean relatedness between individuals in large genomic data sets, which are often discovered and visualized using dimension reduction techniques such as principal component analysis (PCA). Mean relatedness is an average of the relationships across locus-specific genealogical trees, which can be strongly affected on intermediate genomic scales by linked selection and other factors. We show how to use local PCA to describe this intermediate-scale heterogeneity in patterns of relatedness, and apply the method to genomic data from three species, finding in each that the effect of population structure can vary substantially across only a few megabases. In a global human data set, localized heterogeneity is likely explained by polymorphic chromosomal inversions. In a range-wide data set of Medicago truncatula, factors that produce heterogeneity are shared between chromosomes, correlate with local gene density, and may be caused by linked selection, such as background selection or local adaptation. In a data set of primarily African Drosophila melanogaster, large-scale heterogeneity across each chromosome arm is explained by known chromosomal inversions thought to be under recent selection and, after removing samples carrying inversions, remaining heterogeneity is correlated with recombination rate and gene density, again suggesting a role for linked selection. The visualization method provides a flexible new way to discover biological drivers of genetic variation, and its application to data highlights the strong effects that linked selection and chromosomal inversions can have on observed patterns of genetic variation.
Collapse
|
49
|
Škarić-Jurić T, Tomas Ž, Zajc Petranović M, Božina N, Smolej Narančić N, Janićijević B, Salihović MP. Characterization of ADME genes variation in Roma and 20 populations worldwide. PLoS One 2018; 13:e0207671. [PMID: 30452466 PMCID: PMC6242375 DOI: 10.1371/journal.pone.0207671] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2018] [Accepted: 11/05/2018] [Indexed: 12/13/2022] Open
Abstract
The products of the polymorphic ADME genes are involved in Absorption, Distribution, Metabolism, and Excretion of drugs. The pharmacogenetic data have been studied extensively due to their clinical importance in the appropriate drug prescription, but such data from the isolated populations are rather scarce. We analyzed the distribution of 95 polymorphisms in 31 core ADME genes in 20 populations worldwide and in newly genotyped samples from the Roma (Gypsy) population living in Croatia. Global distribution of ADME core gene loci differentiated three major clusters; (1) African, (2) East Asian, and (3) joint European, South Asian and South American cluster. The SLCO1B3 (rs4149117) and CYP3A4 (rs2242480) genes differentiated at the highest level the African group of populations, while NAT2 gene loci (rs1208, rs1801280, and rs1799929) and VKORC1 (rs9923231) differentiated East Asian populations. The VKORC1 rs9923231 was among the investigated loci the one with the largest global minor allele frequency (MAF) range; its MAF ranged from 0.027 in Nigeria to 0.924 in Han Chinese. The distribution of the investigated gene loci positions Roma population within the joined European and South Asian clusters, suggesting that their ADME gene pool is a combination of ancestral (Indian) and more recent (European) surrounding, as it was already implied by other genetic markers. However, when compared to the populations worldwide, the Croatian Roma have extreme MAF values in 10 out of the 95 investigated ADME core gene loci. Among loci which have extraordinary MAFs in Roma population two have strong proof of clinical importance: rs1799853 (CYP2C9) for warfarin dosage, and rs12248560 (CYP2C19) for clopidogrel dosage, efficacy and toxicity. This finding confirms the importance of taking the Roma as well as the other isolated populations`genetic profiles into account in pharmaco-therapeutic practice.
Collapse
Affiliation(s)
| | - Željka Tomas
- Institute for Anthropological Research, Zagreb, Croatia
| | | | - Nada Božina
- Department for Pharmacogenomics and Therapy Individualization, University Hospital Center Zagreb, Department of Pharmacology, University of Zagreb School of Medicine, Zagreb, Croatia
| | | | | | | |
Collapse
|
50
|
Wangkumhang P, Hellenthal G. Statistical methods for detecting admixture. Curr Opin Genet Dev 2018; 53:121-127. [PMID: 30245220 DOI: 10.1016/j.gde.2018.08.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 08/03/2018] [Accepted: 08/09/2018] [Indexed: 10/28/2022]
Abstract
The increasing availability of large-scale autosomal genetic variation data sampled from world-wide geographic areas, coupled with advances in the statistical methodology to analyse these data, is showcasing the power of DNA as a major tool to gain insights into the demographic history of humans and other organisms. Here we review statistical techniques that shed light on a specific aspect of demography: the detection and description of admixture events where two or more genetically distinct groups intermixed at one or more times in the past. In particular we give an overview of some of the widely used methods to identify and describe admixture events using autosomal DNA from unrelated individuals, with a particular focus on analysing biallelic Single-Nucleotide-Polymorphsim (SNP) markers.
Collapse
Affiliation(s)
- Pongsakorn Wangkumhang
- University College London Genetics Institute (UGI), Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - Garrett Hellenthal
- University College London Genetics Institute (UGI), Department of Genetics, Evolution and Environment, University College London, London, United Kingdom.
| |
Collapse
|