1
|
Smaragdov MG, Kudinov AA. Assessing the power of principal components and wright's fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows. BMC Genet 2020; 21:47. [PMID: 32345235 PMCID: PMC7189535 DOI: 10.1186/s12863-020-00848-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 03/27/2020] [Indexed: 11/30/2022] Open
Abstract
Background Due to the advent of SNP array technology, a genome-wide analysis of genetic differences between populations and breeds has become possible at a previously unattainable level. The Wright’s fixation index (Fst) and the principal component analysis (PCA) are widely used methods in animal genetics studies. In paper we compared the power of these methods, their complementing each other and which of them is the most powerful. Results Comparative analysis of the power Principal Components Analysis (PCA) and Fst were carried out to reveal genetic differences between herds of Holsteinized cows. Totally, 803 BovineSNP50 genotypes of cows from 13 herds were used in current study. Obtained Fst values were in the range of 0.002–0.012 (mean 0.0049) while for rare SNPs with MAF 0.0001–0.005 they were even smaller in the range of 0.001–0.01 (mean 0.0027). Genetic relatedness of the cows in the herds was the cause of such small Fst values. The contribution of rare alleles with MAF 0.0001–0.01 to the Fst values was much less than common alleles and this effect depends on linkage disequilibrium (LD). Despite of substantial change in the MAF spectrum and the number of SNPs we observed small effect size of LD - based pruning on Fst data. PCA analysis confirmed the mutual admixture and small genetic difference between herds. Moreover, PCA analysis of the herds based on the visualization the results of a single eigenvector cannot be used to significantly differentiate herds. Only summed eigenvectors should be used to realize full power of PCA to differentiate small between herds genetic difference. Finally, we presented evidences that the significance of Fst data far exceeds the significance of PCA data when these methods are used to reveal genetic differences between herds. Conclusions LD - based pruning had a small effect on findings of Fst and PCA analyzes. Therefore, for weakly structured populations the LD - based pruning is not effective. In addition, our results show that the significance of genetic differences between herds obtained by Fst analysis exceeds the values of PCA. Proposed, to differentiate herds or low structured populations we recommend primarily using the Fst approach and only then PCA.
Collapse
Affiliation(s)
- M G Smaragdov
- Russian Research Institute of Farm Animal Genetics and Breeding - Branch of the l.K. Ernst Federal Science Center for Animal Husbandry, St. Petersburg, Pushkin, Russia. .,, St. Petersburg, Russian Federation.
| | - A A Kudinov
- Russian Research Institute of Farm Animal Genetics and Breeding - Branch of the l.K. Ernst Federal Science Center for Animal Husbandry, St. Petersburg, Pushkin, Russia.,Department of Agricultural Science, University of Helsinki, FI-00014, Helsinki, Finland
| |
Collapse
|
2
|
Yahya P, Sulong S, Harun A, Wangkumhang P, Wilantho A, Ngamphiw C, Tongsima S, Zilfalil BA. Ancestry-informative marker (AIM) SNP panel for the Malay population. Int J Legal Med 2019; 134:123-134. [PMID: 31760471 DOI: 10.1007/s00414-019-02184-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 10/15/2019] [Indexed: 10/25/2022]
Abstract
Ancestry-informative markers (AIMs) can be used to infer the ancestry of an individual to minimize the inaccuracy of self-reported ethnicity in biomedical research. In this study, we describe three methods for selecting AIM SNPs for the Malay population (Malay AIM panel) using different approaches based on pairwise FST, informativeness for assignment (In), and PCA-correlated SNPs (PCAIMs). These Malay AIM panels were extracted from genotype data stored in SNP arrays hosted by the Malaysian node of the Human Variome Project (MyHVP) and the Singapore Genome Variation Project (SGVP). In particular, genotype data from a total of 165 Malay individuals were analyzed, comprising data on 117 individual genotypes from the Affymetrix SNP-6 SNP array platform and data on 48 individual genotypes from the OMNI 2.5 Illumina SNP array platform. The HapMap phase 3 database (1397 individuals from 11 populations) was used as a reference for comparison with the Malay genotype data. The accuracy of each resulting Malay AIM panel was evaluated using a machine learning "ancestry-predictive model" constructed by using WEKA, a comprehensive machine learning platform written in Java. A total of 1250 SNPs were finally selected, which successfully identified Malay individuals from other world populations with an accuracy of 90%, but the accuracy decreased to 80% using 157 SNPs according to the pairwise FST method, while a panel of 200 SNPs selected using In and PCAIMs could be used to identify Malay individuals with an accuracy of approximately 80%.
Collapse
Affiliation(s)
- Padillah Yahya
- Department of Paediatrics, School of Medical Sciences, Universiti Sains Malaysia, 16150, Kubang Kerian, Kelantan, Malaysia
| | - Sarina Sulong
- Human Genome Centre, School of Medical Sciences, Universiti Sains Malaysia, 16150, Kubang Kerian, Kelantan, Malaysia
| | - Azian Harun
- Department of Medical Microbiology and Parasitology, School of Medical Sciences, Universiti Sains Malaysia, 16150, Kubang Kerian, Kelantan, Malaysia
| | - Pongsakorn Wangkumhang
- National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Khlong Luang District, Pathum Thani, 12120, Thailand
| | - Alisa Wilantho
- National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Khlong Luang District, Pathum Thani, 12120, Thailand
| | - Chumpol Ngamphiw
- National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Khlong Luang District, Pathum Thani, 12120, Thailand
| | - Sissades Tongsima
- National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand Science Park, Khlong Luang District, Pathum Thani, 12120, Thailand
| | - Bin Alwi Zilfalil
- Department of Paediatrics, School of Medical Sciences, Universiti Sains Malaysia, 16150, Kubang Kerian, Kelantan, Malaysia.
| |
Collapse
|
3
|
Chaichoompu K, Abegaz F, Cavadas B, Fernandes V, Müller-Myhsok B, Pereira L, Van Steen K. A different view on fine-scale population structure in Western African populations. Hum Genet 2019; 139:45-59. [PMID: 31630246 PMCID: PMC6942040 DOI: 10.1007/s00439-019-02069-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Accepted: 10/09/2019] [Indexed: 01/03/2023]
Abstract
Due to its long genetic evolutionary history, Africans exhibit more genetic variation than any other population in the world. Their genetic diversity further lends itself to subdivisions of Africans into groups of individuals with a genetic similarity of varying degrees of granularity. It remains challenging to detect fine-scale structure in a computationally efficient and meaningful way. In this paper, we present a proof-of-concept of a novel fine-scale population structure detection tool with Western African samples. These samples consist of 1396 individuals from 25 ethnic groups (two groups are African American descendants). The strategy is based on a recently developed tool called IPCAPS. IPCAPS, or Iterative Pruning to CApture Population Structure, is a genetic divisive clustering strategy that enhances iterative pruning PCA, is robust to outliers and does not require a priori computation of haplotypes. Our strategy identified in total 12 groups and 6 groups were revealed as fine-scale structure detected in the samples from Cameroon, Gambia, Mali, Southwest USA, and Barbados. Our finding helped to explain evolutionary processes in the analyzed West African samples and raise awareness for fine-scale structure resolution when conducting genome-wide association and interaction studies.
Collapse
Affiliation(s)
- Kridsadakorn Chaichoompu
- GIGA-R Medical Genomics-BIO3, University of Liege, Avenue de l'Hôpital 11, 4000, Liege, Belgium. .,Max Planck Institute of Psychiatry, 80804, Munich, Germany.
| | - Fentaw Abegaz
- GIGA-R Medical Genomics-BIO3, University of Liege, Avenue de l'Hôpital 11, 4000, Liege, Belgium
| | - Bruno Cavadas
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto (i3S), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal.,Instituto de Patologia e Imunologia Molecular da Universidade do Porto (IPATIMUP), Rua Júlio Amaral de Carvalho, 45, 4200-135, Porto, Portugal
| | - Verónica Fernandes
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto (i3S), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal.,Instituto de Patologia e Imunologia Molecular da Universidade do Porto (IPATIMUP), Rua Júlio Amaral de Carvalho, 45, 4200-135, Porto, Portugal
| | | | - Luísa Pereira
- Instituto de Investigação e Inovação em Saúde, Universidade do Porto (i3S), Rua Alfredo Allen, 208, 4200-135, Porto, Portugal.,Instituto de Patologia e Imunologia Molecular da Universidade do Porto (IPATIMUP), Rua Júlio Amaral de Carvalho, 45, 4200-135, Porto, Portugal
| | - Kristel Van Steen
- GIGA-R Medical Genomics-BIO3, University of Liege, Avenue de l'Hôpital 11, 4000, Liege, Belgium. .,WELBIO (Walloon Excellence in Lifesciences and Biotechnology), Avenue Pasteur 6, 1300, Wavre, Belgium.
| |
Collapse
|
4
|
Tvedebrink T, Eriksen PS. Inference of admixed ancestry with Ancestry Informative Markers. Forensic Sci Int Genet 2019; 42:147-153. [DOI: 10.1016/j.fsigen.2019.06.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 05/29/2019] [Accepted: 06/18/2019] [Indexed: 01/26/2023]
|
5
|
Chaichoompu K, Abegaz F, Tongsima S, Shaw PJ, Sakuntabhai A, Pereira L, Van Steen K. IPCAPS: an R package for iterative pruning to capture population structure. SOURCE CODE FOR BIOLOGY AND MEDICINE 2019; 14:2. [PMID: 30936940 PMCID: PMC6427891 DOI: 10.1186/s13029-019-0072-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Accepted: 02/21/2019] [Indexed: 01/29/2023]
Abstract
Background Resolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target. Results This work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors. Conclusions IPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from http://bio3.giga.ulg.ac.be/ipcaps.
Collapse
Affiliation(s)
- Kridsadakorn Chaichoompu
- 1GIGA-R Medical Genomics - BIO3, University of Liege, Avenue de l'Hôpital 11, 4000 Liege, Belgium
| | - Fentaw Abegaz
- 1GIGA-R Medical Genomics - BIO3, University of Liege, Avenue de l'Hôpital 11, 4000 Liege, Belgium
| | - Sissades Tongsima
- 2Genome Technology Research Unit, National Center for Genetic Engineering and Biotechnology, 113 Thailand Science Park, Phahonyothin Road, Khlong Neung, Khlong Luang, Pathum Thani 12120 Thailand
| | - Philip James Shaw
- 3Medical Molecular Biology Research Unit, National Center for Genetic Engineering and Biotechnology, 113 Thailand Science Park, Phahonyothin Road, Khlong Neung, Khlong Luang, Pathum Thani 12120 Thailand
| | - Anavaj Sakuntabhai
- 4Functional Genetics of Infectious Diseases Unit, Institut Pasteur, 25-28, rue du Docteur Roux, 75015 Paris, France.,5Centre National de la Recherche Scientifique, URA3012, Paris, France
| | - Luísa Pereira
- 6Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Rua Alfredo Allen, 208, 4200-135 Porto, Portugal.,7Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Rua Júlio Amaral de Carvalho, 45, 4200-135 Porto, Portugal
| | - Kristel Van Steen
- 1GIGA-R Medical Genomics - BIO3, University of Liege, Avenue de l'Hôpital 11, 4000 Liege, Belgium.,WELBIO (Walloon Excellence in Lifesciences and Biotechnology), Avenue Pasteur 6, 1300 Wavre, Belgium
| |
Collapse
|
6
|
Cheung EY, Gahan ME, McNevin D. Prediction of biogeographical ancestry in admixed individuals. Forensic Sci Int Genet 2018; 36:104-111. [DOI: 10.1016/j.fsigen.2018.06.013] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Revised: 05/09/2018] [Accepted: 06/20/2018] [Indexed: 12/14/2022]
|
7
|
Alhusain L, Hafez AM. Nonparametric approaches for population structure analysis. Hum Genomics 2018; 12:25. [PMID: 29743099 PMCID: PMC5944014 DOI: 10.1186/s40246-018-0156-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 04/24/2018] [Indexed: 12/28/2022] Open
Abstract
The analysis of population structure has many applications in medical and population genetic research. Such analysis is used to provide clear insight into the underlying genetic population substructure and is a crucial prerequisite for any analysis of genetic data. The analysis involves grouping individuals into subpopulations based on shared genetic variations. The most widely used markers to study the variation of DNA sequences between populations are single nucleotide polymorphisms. Data preprocessing is a necessary step to assess the quality of the data and to determine which markers or individuals can reasonably be included in the analysis. After preprocessing, several methods can be utilized to uncover population substructure, which can be categorized into two broad approaches: parametric and nonparametric. Parametric approaches use statistical models to infer population structure and assign individuals into subpopulations. However, these approaches suffer from many drawbacks that make them impractical for large datasets. In contrast, nonparametric approaches do not suffer from these drawbacks, making them more viable than parametric approaches for analyzing large datasets. Consequently, nonparametric approaches are increasingly used to reveal population substructure. Thus, this paper reviews and discusses the nonparametric approaches that are available for population structure analysis along with some implications to resolve challenges.
Collapse
Affiliation(s)
- Luluah Alhusain
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
| | - Alaaeldin M Hafez
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
8
|
Yahya P, Sulong S, Harun A, Wan Isa H, Ab Rajab NS, Wangkumhang P, Wilantho A, Ngamphiw C, Tongsima S, Zilfalil BA. Analysis of the genetic structure of the Malay population: Ancestry-informative marker SNPs in the Malay of Peninsular Malaysia. Forensic Sci Int Genet 2017; 30:152-159. [DOI: 10.1016/j.fsigen.2017.07.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Revised: 06/23/2017] [Accepted: 07/10/2017] [Indexed: 12/27/2022]
|
9
|
A comparison of DMET Plus microarray and genome-wide technologies by assessing population substructure. Pharmacogenet Genomics 2016; 26:147-153. [PMID: 26731477 DOI: 10.1097/fpc.0000000000000200] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
OBJECTIVE The capacity of the Affymetrix drug metabolism enzymes and transporters (DMET) Plus pharmacogenomics genotyping chip to estimate population substructure and cryptic relatedness was evaluated. The results were compared with estimates using genome-wide HapMap data for the same individuals. METHODS For 301 unrelated individuals, spanning three continental populations and one admixed population, genotypic data were collected using the Affymetrix DMET Plus microarray. Genome-wide data on these individuals were obtained from HapMap release 3. Population substructure was assessed using Eigenstrat and ADMIXTURE software for both platforms. Cryptic relatedness was explored by inbreeding coefficient estimation. Nonparametric tests were used to determine correlations of the analytical results of the two genotyping platforms. RESULTS Principal components analysis identified population substructure for both datasets, with 15.8 and 16.6% of the total variance explained in the first two principal components for DMET Plus and HapMap data, respectively. ADMIXTURE results correctly identified four subpopulations within each dataset. Nonparametric rank correlations indicated significant associations between analyses with an average ρ=0.7272 (P<10) across the three continental populations and ρ=0.4888 for the admixed population. Concordance correlation coefficients (average ρc=0.9693 across all four subpopulations) strongly indicate concordance between ADMIXTURE results. Inbreeding coefficients were slightly inflated (16 individuals>0.15) using DMET Plus data and no cryptic relatedness was indicated using HapMap data. The inflated inbreeding estimation could be because of the limited number of markers provided by DMET as a random sample of 1832 markers from HapMap also yielded inflated estimates of cryptic relatedness (39 individuals>0.15). Furthermore, use of single nucleotide polymorphisms located in genes involved in metabolism and transport may have different allele frequencies in subpopulations than single nucleotide polymorphisms sampled from the whole genome. CONCLUSION The DMET Plus pharmacogenomics genotyping chip is effective in quantifying population substructure across the three continental populations and inferring the presence of an admixed population. On the basis of our results, these microarrays offer sufficient depth for covariate adjustment of population substructure in genomic association studies.
Collapse
|
10
|
Duforet-Frebourg N, Gattepaille LM, Blum MGB, Jakobsson M. HaploPOP: a software that improves population assignment by combining markers into haplotypes. BMC Bioinformatics 2015; 16:242. [PMID: 26227424 PMCID: PMC4521458 DOI: 10.1186/s12859-015-0661-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2014] [Accepted: 07/03/2015] [Indexed: 01/27/2023] Open
Abstract
Background In ecology and forensics, some population assignment techniques use molecular markers to assign individuals to known groups. However, assigning individuals to known populations can be difficult if the level of genetic differentiation among populations is small. Most assignment studies handle independent markers, often by pruning markers in Linkage Disequilibrium (LD), ignoring the information contained in the correlation among markers due to LD. Results To improve the accuracy of population assignment, we present an algorithm, implemented in the HaploPOP software, that combines markers into haplotypes, without requiring independence. The algorithm is based on the Gain of Informativeness for Assignment that provides a measure to decide if a pair of markers should be combined into haplotypes, or not, in order to improve assignment. Because complete exploration of all possible solutions for constructing haplotypes is computationally prohibitive, our approach uses a greedy algorithm based on windows of fixed sizes. We evaluate the performance of HaploPOP to assign individuals to populations using a split-validation approach. We investigate both simulated SNPs data and dense genotype data from individuals from Spain and Portugal. Conclusions Our results show that constructing haplotypes with HaploPOP can substantially reduce assignment error. The HaploPOP software is freely available as a command-line software at www.ieg.uu.se/Jakobsson/software/HaploPOP/.
Collapse
Affiliation(s)
- Nicolas Duforet-Frebourg
- Univ. Grenoble Alpes, TIMC-IMAG, Grenoble, F-38000, France. .,CNRS, TIMC-IMAG, Grenoble, F-38000, France. .,Department of Integrative Biology, University of California Berkeley, Berkeley, 94720-3140, California, USA.
| | - Lucie M Gattepaille
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.
| | - Michael G B Blum
- Univ. Grenoble Alpes, TIMC-IMAG, Grenoble, F-38000, France. .,CNRS, TIMC-IMAG, Grenoble, F-38000, France.
| | - Mattias Jakobsson
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden. .,Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
11
|
Waters EK, Sidhu HS, Sidhu LA, Mercer GN. Extended Lotka–Volterra equations incorporating population heterogeneity: Derivation and analysis of the predator–prey case. Ecol Modell 2015. [DOI: 10.1016/j.ecolmodel.2014.11.019] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
12
|
Putman AI, Carbone I. Challenges in analysis and interpretation of microsatellite data for population genetic studies. Ecol Evol 2014; 4:4399-428. [PMID: 25540699 PMCID: PMC4267876 DOI: 10.1002/ece3.1305] [Citation(s) in RCA: 237] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2014] [Revised: 10/02/2014] [Accepted: 10/03/2014] [Indexed: 12/14/2022] Open
Abstract
Advancing technologies have facilitated the ever-widening application of genetic markers such as microsatellites into new systems and research questions in biology. In light of the data and experience accumulated from several years of using microsatellites, we present here a literature review that synthesizes the limitations of microsatellites in population genetic studies. With a focus on population structure, we review the widely used fixation (F ST) statistics and Bayesian clustering algorithms and find that the former can be confusing and problematic for microsatellites and that the latter may be confounded by complex population models and lack power in certain cases. Clustering, multivariate analyses, and diversity-based statistics are increasingly being applied to infer population structure, but in some instances these methods lack formalization with microsatellites. Migration-specific methods perform well only under narrow constraints. We also examine the use of microsatellites for inferring effective population size, changes in population size, and deeper demographic history, and find that these methods are untested and/or highly context-dependent. Overall, each method possesses important weaknesses for use with microsatellites, and there are significant constraints on inferences commonly made using microsatellite markers in the areas of population structure, admixture, and effective population size. To ameliorate and better understand these constraints, researchers are encouraged to analyze simulated datasets both prior to and following data collection and analysis, the latter of which is formalized within the approximate Bayesian computation framework. We also examine trends in the literature and show that microsatellites continue to be widely used, especially in non-human subject areas. This review assists with study design and molecular marker selection, facilitates sound interpretation of microsatellite data while fostering respect for their practical limitations, and identifies lessons that could be applied toward emerging markers and high-throughput technologies in population genetics.
Collapse
Affiliation(s)
- Alexander I Putman
- Department of Plant Pathology, North Carolina State University Raleigh, North Carolina, 27695-7616
| | - Ignazio Carbone
- Department of Plant Pathology, North Carolina State University Raleigh, North Carolina, 27695-7616
| |
Collapse
|
13
|
Limpiti T, Amornbunchornvej C, Intarapanich A, Assawamakin A, Tongsima S. iNJclust: Iterative Neighbor-Joining Tree Clustering Framework for Inferring Population Structure. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:903-914. [PMID: 26356862 DOI: 10.1109/tcbb.2014.2322372] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Understanding genetic differences among populations is one of the most important issues in population genetics. Genetic variations, e.g., single nucleotide polymorphisms, are used to characterize commonality and difference of individuals from various populations. This paper presents an efficient graph-based clustering framework which operates iteratively on the Neighbor-Joining (NJ) tree called the iNJclust algorithm. The framework uses well-known genetic measurements, namely the allele-sharing distance, the neighbor-joining tree, and the fixation index. The behavior of the fixation index is utilized in the algorithm's stopping criterion. The algorithm provides an estimated number of populations, individual assignments, and relationships between populations as outputs. The clustering result is reported in the form of a binary tree, whose terminal nodes represent the final inferred populations and the tree structure preserves the genetic relationships among them. The clustering performance and the robustness of the proposed algorithm are tested extensively using simulated and real data sets from bovine, sheep, and human populations. The result indicates that the number of populations within each data set is reasonably estimated, the individual assignment is robust, and the structure of the inferred population tree corresponds to the intrinsic relationships among populations within the data.
Collapse
|
14
|
Wangkumhang P, Shaw PJ, Chaichoompu K, Ngamphiw C, Assawamakin A, Nuinoon M, Sripichai O, Svasti S, Fucharoen S, Praphanphoj V, Tongsima S. Insight into the peopling of Mainland Southeast Asia from Thai population genetic structure. PLoS One 2013; 8:e79522. [PMID: 24223962 PMCID: PMC3817124 DOI: 10.1371/journal.pone.0079522] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Accepted: 09/23/2013] [Indexed: 12/22/2022] Open
Abstract
There is considerable ethno-linguistic and genetic variation among human populations in Asia, although tracing the origins of this diversity is complicated by migration events. Thailand is at the center of Mainland Southeast Asia (MSEA), a region within Asia that has not been extensively studied. Genetic substructure may exist in the Thai population, since waves of migration from southern China throughout its recent history may have contributed to substantial gene flow. Autosomal SNP data were collated for 438,503 markers from 992 Thai individuals. Using the available self-reported regional origin, four Thai subpopulations genetically distinct from each other and from other Asian populations were resolved by Neighbor-Joining analysis using a 41,569 marker subset. Using an independent Principal Components-based unsupervised clustering approach, four major MSEA subpopulations were resolved in which regional bias was apparent. A major ancestry component was common to these MSEA subpopulations and distinguishes them from other Asian subpopulations. On the other hand, these MSEA subpopulations were admixed with other ancestries, in particular one shared with Chinese. Subpopulation clustering using only Thai individuals and the complete marker set resolved four subpopulations, which are distributed differently across Thailand. A Sino-Thai subpopulation was concentrated in the Central region of Thailand, although this constituted a minority in an otherwise diverse region. Among the most highly differentiated markers which distinguish the Thai subpopulations, several map to regions known to affect phenotypic traits such as skin pigmentation and susceptibility to common diseases. The subpopulation patterns elucidated have important implications for evolutionary and medical genetics. The subpopulation structure within Thailand may reflect the contributions of different migrants throughout the history of MSEA. The information will also be important for genetic association studies to account for population-structure confounding effects.
Collapse
Affiliation(s)
- Pongsakorn Wangkumhang
- National Center for Genetic Engineering and Biotechnology (BioTeC), Khlong Luang, Pathum Thani, Thailand
- Inter-Department Program of Biomedical Sciences, Chulalongkorn University, Pathumwan, Bangkok, Thailand
| | - Philip James Shaw
- National Center for Genetic Engineering and Biotechnology (BioTeC), Khlong Luang, Pathum Thani, Thailand
| | - Kridsadakorn Chaichoompu
- National Center for Genetic Engineering and Biotechnology (BioTeC), Khlong Luang, Pathum Thani, Thailand
| | - Chumpol Ngamphiw
- National Center for Genetic Engineering and Biotechnology (BioTeC), Khlong Luang, Pathum Thani, Thailand
- Inter-Department Program of Biomedical Sciences, Chulalongkorn University, Pathumwan, Bangkok, Thailand
| | | | - Manit Nuinoon
- School of Allied Health Sciences and Public Health, Walailak University, Thai Buri, Nakhon Sri Thammarat, Thailand
| | - Orapan Sripichai
- Thalassemia Research Center, Mahidol University, Salaya, Nakhon Pathom, Thailand
| | - Saovaros Svasti
- Thalassemia Research Center, Mahidol University, Salaya, Nakhon Pathom, Thailand
| | - Suthat Fucharoen
- Thalassemia Research Center, Mahidol University, Salaya, Nakhon Pathom, Thailand
| | - Verayuth Praphanphoj
- Center for Medical Genetics Research, Rajanukul Institute, Dindaeng, Bangkok, Thailand
| | - Sissades Tongsima
- National Center for Genetic Engineering and Biotechnology (BioTeC), Khlong Luang, Pathum Thani, Thailand
- * E-mail:
| |
Collapse
|
15
|
Liu Y, Nyunoya T, Leng S, Belinsky SA, Tesfaigzi Y, Bruse S. Softwares and methods for estimating genetic ancestry in human populations. Hum Genomics 2013; 7:1. [PMID: 23289408 PMCID: PMC3542037 DOI: 10.1186/1479-7364-7-1] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Accepted: 11/26/2012] [Indexed: 01/10/2023] Open
Abstract
The estimation of genetic ancestry in human populations has important applications in medical genetic studies. Genetic ancestry is used to control for population stratification in genetic association studies, and is used to understand the genetic basis for ethnic differences in disease susceptibility. In this review, we present an overview of genetic ancestry estimation in human disease studies, followed by a review of popular softwares and methods used for this estimation.
Collapse
Affiliation(s)
- Yushi Liu
- Lovelace Respiratory Research Institute, Albuquerque, NM 87108, USA
| | | | | | | | | | | |
Collapse
|
16
|
Neuditschko M, Khatkar MS, Raadsma HW. NetView: a high-definition network-visualization approach to detect fine-scale population structures from genome-wide patterns of variation. PLoS One 2012; 7:e48375. [PMID: 23152744 PMCID: PMC3485224 DOI: 10.1371/journal.pone.0048375] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2012] [Accepted: 09/25/2012] [Indexed: 02/06/2023] Open
Abstract
High-throughput sequencing and single nucleotide polymorphism (SNP) genotyping can be used to infer complex population structures. Fine-scale population structure analysis tracing individual ancestry remains one of the major challenges. Based on network theory and recent advances in SNP chip technology, we investigated an unsupervised network clustering method called Super Paramagnetic Clustering (Spc). When applied to whole-genome marker data it identifies the natural divisions of groups of individuals into population clusters without use of prior ancestry information. Furthermore, we optimised an analysis pipeline called NetView, a high-definition network visualization, starting with computation of genetic distance, followed clustering using Spc and finally visualization of clusters with Cytoscape. We compared NetView against commonly used methodologies including Principal Component Analyses (PCA) and a model-based algorithm, Admixture, on whole-genome-wide SNP data derived from three previously described data sets: simulated (2.5 million SNPs, 5 populations), human (1.4 million SNPs, 11 populations) and cattle (32,653 SNPs, 19 populations). We demonstrate that individuals can be effectively allocated to their correct population whilst simultaneously revealing fine-scale structure within the populations. Analyzing the human HapMap populations, we identified unexpected genetic relatedness among individuals, and population stratification within the Indian, African and Mexican samples. In the cattle data set, we correctly assigned all individuals to their respective breeds and detected fine-scale population sub-structures reflecting different sample origins and phenotypes. The NetView pipeline is computationally extremely efficient and can be easily applied on large-scale genome-wide data sets to assign individuals to particular populations and to reproduce fine-scale population structures without prior knowledge of individual ancestry. NetView can be used on any data from which a genetic relationship/distance between individuals can be calculated.
Collapse
Affiliation(s)
- Markus Neuditschko
- Reprogen-Animal Bioscience, Faculty of Veterinary Science, University of Sydney, Camden, New South Wales, Australia.
| | | | | |
Collapse
|
17
|
Abstract
A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.
Collapse
Affiliation(s)
- Daniel John Lawson
- Heilbronn Institute for Mathematical Research, School of Mathematics, University of Bristol, Bristol BS8 1TW, UK.
| | | |
Collapse
|
18
|
Lenstra JA, Groeneveld LF, Eding H, Kantanen J, Williams JL, Taberlet P, Nicolazzi EL, Sölkner J, Simianer H, Ciani E, Garcia JF, Bruford MW, Ajmone-Marsan P, Weigend S. Molecular tools and analytical approaches for the characterization of farm animal genetic diversity. Anim Genet 2012; 43:483-502. [DOI: 10.1111/j.1365-2052.2011.02309.x] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/15/2011] [Indexed: 12/30/2022]
Affiliation(s)
- J. A. Lenstra
- Faculty of Veterinary Medicine; Utrecht University; Utrecht; The Netherlands
| | - L. F. Groeneveld
- Institute of Farm Animal Genetics; Friedrich-Loeffler-Institut; Hoeltystr. 10; 31535; Neustadt; Germany
| | - H. Eding
- Animal Evaluations Unit; CRV; Arnhem; The Netherlands
| | - J. Kantanen
- Biotechnology and Food Research; MTT Agrifood Research Finland; FI-31600; Jokioinen; Finland
| | - J. L. Williams
- Parco Tecnologico Padano; via Einstein; 2600; Lodi; Italy
| | - P. Taberlet
- Laboratoire d'Ecologie Alpine; Université Joseph Fourier; BP 53; Grenoble; France
| | - E. L. Nicolazzi
- Istituto di Zootecnica and BioDNA Research Centre; Università Cattolica del Sacro Cuore; Piacenza; Italy
| | - J. Sölkner
- Department of Sustainable Agricultural Systems; Animal Breeding Group; BOKU - University of Natural Resources and Life Sciences; Vienna; Austria
| | - H. Simianer
- Department of Animal Sciences; Animal Breeding and Genetics Group; Georg-August-University Göttingen; 37075; Göttingen; Germany
| | - E. Ciani
- Department of General and Environmental Physiology; University of Bari “Aldo Moro”; Bari; Italy
| | - J. F. Garcia
- Universidade Estadual Paulista; Araçatuba; Brazil
| | - M. W. Bruford
- Organisms and Environment Division; School of Biosciences; Cardiff University; Cardiff; UK
| | - P. Ajmone-Marsan
- Istituto di Zootecnica and BioDNA Research Centre; Università Cattolica del Sacro Cuore; Piacenza; Italy
| | - S. Weigend
- Institute of Farm Animal Genetics; Friedrich-Loeffler-Institut; Hoeltystr. 10; 31535; Neustadt; Germany
| |
Collapse
|
19
|
Limpiti T, Intarapanich A, Assawamakin A, Shaw PJ, Wangkumhang P, Piriyapongsa J, Ngamphiw C, Tongsima S. Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure. BMC Bioinformatics 2011; 12:255. [PMID: 21699684 PMCID: PMC3148578 DOI: 10.1186/1471-2105-12-255] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 06/23/2011] [Indexed: 01/20/2023] Open
Abstract
Background The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis. Results A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA. Conclusions The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.
Collapse
Affiliation(s)
- Tulaya Limpiti
- Faculty of Engineering, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
| | | | | | | | | | | | | | | |
Collapse
|