Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wu B, Liu N, Zhao H. PSMIX: an R package for population structure inference via maximum likelihood method. BMC Bioinformatics 2006;7:317. [PMID: 16792813 PMCID: PMC1550430 DOI: 10.1186/1471-2105-7-317] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Accepted: 06/22/2006] [Indexed: 01/06/2023] Open

Number

Cited by Other Article(s)

Sethuraman A, Janzen FJ, Weisrock DW, Obrycki JJ. Insights from Population Genomics to Enhance and Sustain Biological Control of Insect Pests. Insects 2020;11:E462. [PMID: 32708047 PMCID: PMC7469154 DOI: 10.3390/insects11080462] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 07/15/2020] [Accepted: 07/17/2020] [Indexed: 01/25/2023]

Chen Y, Liang KY, Tong P, Beaty TH, Barnes KC, Linda Kao WH. A pseudolikelihood approach for assessing genetic association in case-control studies with unmeasured population structure. Stat Methods Med Res 2020;29:3153-3165. [PMID: 32393154 DOI: 10.1177/0962280220921212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Alhusain L, Hafez AM. Nonparametric approaches for population structure analysis. Hum Genomics 2018;12:25. [PMID: 29743099 PMCID: PMC5944014 DOI: 10.1186/s40246-018-0156-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 04/24/2018] [Indexed: 12/28/2022] Open

de Los Campos G, Veturi Y, Vazquez AI, Lehermeier C, Pérez-Rodríguez P. Incorporating Genetic Heterogeneity in Whole-Genome Regressions Using Interactions. J Agric Biol Environ Stat 2015;20:467-490. [PMID: 26660276 PMCID: PMC4666286 DOI: 10.1007/s13253-015-0222-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 09/16/2015] [Indexed: 11/22/2022]

Abstract

Naturally and artificially selected populations usually exhibit some degree of stratification. In Genome-Wide Association Studies and in Whole-Genome Regressions (WGR) analyses, population stratification has been either ignored or dealt with as a potential confounder. However, systematic differences in allele frequency and in patterns of linkage disequilibrium can induce sub-population-specific effects. From this perspective, structure acts as an effect modifier rather than as a confounder. In this article, we extend WGR models commonly used in plant and animal breeding to allow for sub-population-specific effects. This is achieved by decomposing marker effects into main effects and interaction components that describe group-specific deviations. The model can be used both with variable selection and shrinkage methods and can be implemented using existing software for genomic selection. Using a wheat and a pig breeding data set, we compare parameter estimates and the prediction accuracy of the interaction WGR model with WGR analysis ignoring population stratification (across-group analysis) and with a stratified (i.e., within-sub-population) WGR analysis. The interaction model renders trait-specific estimates of the average correlation of effects between sub-populations; we find that such correlation not only depends on the extent of genetic differentiation in allele frequencies between groups but also varies among traits. The evaluation of prediction accuracy shows a modest superiority of the interaction model relative to the other two approaches. This superiority is the result of better stability in performance of the interaction models across data sets and traits; indeed, in almost all cases, the interaction model was either the best performing model or it performed close to the best performing model.

Collapse

Lehermeier C, Schön CC, de Los Campos G. Assessment of Genetic Heterogeneity in Structured Plant Populations Using Multivariate Whole-Genome Regression Models. Genetics 2015;201:323-37. [PMID: 26122758 DOI: 10.1534/genetics.115.177394] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 06/25/2015] [Indexed: 01/27/2023] Open

Abstract

Plant breeding populations exhibit varying levels of structure and admixture; these features are likely to induce heterogeneity of marker effects across subpopulations. Traditionally, structure has been dealt with as a potential confounder, and various methods exist to "correct" for population stratification. However, these methods induce a mean correction that does not account for heterogeneity of marker effects. The animal breeding literature offers a few recent studies that consider modeling genetic heterogeneity in multibreed data, using multivariate models. However, these methods have received little attention in plant breeding where population structure can have different forms. In this article we address the problem of analyzing data from heterogeneous plant breeding populations, using three approaches: (a) a model that ignores population structure [A-genome-based best linear unbiased prediction (A-GBLUP)], (b) a stratified (i.e., within-group) analysis (W-GBLUP), and (c) a multivariate approach that uses multigroup data and accounts for heterogeneity (MG-GBLUP). The performance of the three models was assessed on three different data sets: a diversity panel of rice (Oryza sativa), a maize (Zea mays L.) half-sib panel, and a wheat (Triticum aestivum L.) data set that originated from plant breeding programs. The estimated genomic correlations between subpopulations varied from null to moderate, depending on the genetic distance between subpopulations and traits. Our assessment of prediction accuracy features cases where ignoring population structure leads to a parsimonious more powerful model as well as others where the multivariate and stratified approaches have higher predictive power. In general, the multivariate approach appeared slightly more robust than either the A- or the W-GBLUP.

Collapse

Parry RM, Wang MD. A fast least-squares algorithm for population inference. BMC Bioinformatics 2013;14:28. [PMID: 23343408 PMCID: PMC3602075 DOI: 10.1186/1471-2105-14-28] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 11/06/2012] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.

RESULTS

We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

CONCLUSIONS

The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

Collapse

Kumar R, Williams LK, Kato A, Peterson EL, Favoreto S, Hulse K, Wang D, Beckman K, Thyne S, LeNoir M, Meade K, Lanfear DE, Levin AM, Favro D, Yang JJ, Weiss K, Boushey HA, Grammer L, Avila PC, Burchard EG, Schleimer R. Genetic variation in B cell-activating factor of the TNF family (BAFF) and asthma exacerbations among African American subjects. J Allergy Clin Immunol 2012;130:996-9.e6. [PMID: 22728080 DOI: 10.1016/j.jaci.2012.04.047] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Revised: 01/30/2012] [Accepted: 04/11/2012] [Indexed: 10/28/2022]

Blair C, Weigel DE, Balazik M, Keeley ATH, Walker FM, Landguth E, Cushman S, Murphy M, Waits L, Balkenhol N. A simulation-based evaluation of methods for inferring linear barriers to gene flow. Mol Ecol Resour 2012;12:822-33. [PMID: 22551194 DOI: 10.1111/j.1755-0998.2012.03151.x] [Citation(s) in RCA: 116] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Abstract

Different analytical techniques used on the same data set may lead to different conclusions about the existence and strength of genetic structure. Therefore, reliable interpretation of the results from different methods depends on the efficacy and reliability of different statistical methods. In this paper, we evaluated the performance of multiple analytical methods to detect the presence of a linear barrier dividing populations. We were specifically interested in determining if simulation conditions, such as dispersal ability and genetic equilibrium, affect the power of different analytical methods for detecting barriers. We evaluated two boundary detection methods (Monmonier's algorithm and WOMBLING), two spatial Bayesian clustering methods (TESS and GENELAND), an aspatial clustering approach (STRUCTURE), and two recently developed, non-Bayesian clustering methods [PSMIX and discriminant analysis of principal components (DAPC)]. We found that clustering methods had higher success rates than boundary detection methods and also detected the barrier more quickly. All methods detected the barrier more quickly when dispersal was long distance in comparison to short-distance dispersal scenarios. Bayesian clustering methods performed best overall, both in terms of highest success rates and lowest time to barrier detection, with GENELAND showing the highest power. None of the methods suggested a continuous linear barrier when the data were generated under an isolation-by-distance (IBD) model. However, the clustering methods had higher potential for leading to incorrect barrier inferences under IBD unless strict criteria for successful barrier detection were implemented. Based on our findings and those of previous simulation studies, we discuss the utility of different methods for detecting linear barriers to gene flow.

Collapse

Ding L, Wiener H, Abebe T, Altaye M, Go RCP, Kercsmar C, Grabowski G, Martin LJ, Khurana Hershey GK, Chakorborty R, Baye TM. Comparison of measures of marker informativeness for ancestry and admixture mapping. BMC Genomics 2011;12:622. [PMID: 22185208 PMCID: PMC3276602 DOI: 10.1186/1471-2164-12-622] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2011] [Accepted: 12/20/2011] [Indexed: 11/10/2022] Open

Choi SC, Hey J. Joint inference of population assignment and demographic history. Genetics 2011;189:561-77. [PMID: 21775468 DOI: 10.1534/genetics.111.129205] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Onogi A, Nurimoto M, Morita M. Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods. BMC Bioinformatics 2011;12:263. [PMID: 21708038 PMCID: PMC3161044 DOI: 10.1186/1471-2105-12-263] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2010] [Accepted: 06/28/2011] [Indexed: 11/16/2022] Open

Abstract

Background

A Bayesian approach based on a Dirichlet process (DP) prior is useful for inferring genetic population structures because it can infer the number of populations and the assignment of individuals simultaneously. However, the properties of the DP prior method are not well understood, and therefore, the use of this method is relatively uncommon. We characterized the DP prior method to increase its practical use.

Results

First, we evaluated the usefulness of the sequentially-allocated merge-split (SAMS) sampler, which is a technique for improving the mixing of Markov chain Monte Carlo algorithms. Although this sampler has been implemented in a preceding program, HWLER, its effectiveness has not been investigated. We showed that this sampler was effective for population structure analysis. Implementation of this sampler was useful with regard to the accuracy of inference and computational time. Second, we examined the effect of a hyperparameter for the prior distribution of allele frequencies and showed that the specification of this parameter was important and could be resolved by considering the parameter as a variable. Third, we compared the DP prior method with other Bayesian clustering methods and showed that the DP prior method was suitable for data sets with unbalanced sample sizes among populations. In contrast, although current popular algorithms for population structure analysis, such as those implemented in STRUCTURE, were suitable for data sets with uniform sample sizes, inferences with these algorithms for unbalanced sample sizes tended to be less accurate than those with the DP prior method.

Conclusions

The clustering method based on the DP prior was found to be useful because it can infer the number of populations and simultaneously assign individuals into populations, and it is suitable for data sets with unbalanced sample sizes among populations. Here we presented a novel program, DPART, that implements the SAMS sampler and can consider the hyperparameter for the prior distribution of allele frequencies to be a variable.

Collapse

Gould W, Peterson EL, Karungi G, Zoratti A, Gaggin J, Toma G, Yan S, Levin AM, Yang JJ, Wells K, Wang M, Burke RR, Beckman K, Popadic D, Land SJ, Kumar R, Seibold MA, Lanfear DE, Burchard EG, Williams LK. Factors predicting inhaled corticosteroid responsiveness in African American patients with asthma. J Allergy Clin Immunol 2011;126:1131-8. [PMID: 20864153 DOI: 10.1016/j.jaci.2010.08.002] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2010] [Revised: 07/30/2010] [Accepted: 08/02/2010] [Indexed: 01/13/2023]

LIU NIANJUN, ZHAO HONGYU, PATKI AMIT, LIMDI NITAA, ALLISON DAVIDB. Controlling Population Structure in Human Genetic Association Studies with Samples of Unrelated Individuals. Stat Interface 2011;4:317-326. [PMID: 22308192 PMCID: PMC3269890 DOI: 10.4310/sii.2011.v4.n3.a6] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

Baye TM, Wilke RA. Mapping genes that predict treatment outcome in admixed populations. Pharmacogenomics J 2010;10:465-77. [PMID: 20921971 PMCID: PMC2991422 DOI: 10.1038/tpj.2010.71] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2010] [Revised: 07/07/2010] [Accepted: 08/05/2010] [Indexed: 01/19/2023]

Jin Y, Hu D, Peterson EL, Eng C, Levin AM, Wells K, Beckman K, Kumar R, Seibold MA, Karungi G, Zoratti A, Gaggin J, Campbell J, Galanter J, Chapela R, Rodríguez-Santana JR, Watson HG, Meade K, Lenoir M, Rodríguez-Cintrón W, Avila PC, Lanfear DE, Burchard EG, Williams LK. Dual-specificity phosphatase 1 as a pharmacogenetic modifier of inhaled steroid response among asthmatic patients. J Allergy Clin Immunol 2010;126:618-25.e1-2. [PMID: 20673984 DOI: 10.1016/j.jaci.2010.06.007] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Revised: 06/03/2010] [Accepted: 06/08/2010] [Indexed: 11/15/2022]

Kumar R, Seibold MA, Aldrich MC, Williams LK, Reiner AP, Colangelo L, Galanter J, Gignoux C, Hu D, Sen S, Choudhry S, Peterson EL, Rodriguez-Santana J, Rodriguez-Cintron W, Nalls MA, Leak TS, O'Meara E, Meibohm B, Kritchevsky SB, Li R, Harris TB, Nickerson DA, Fornage M, Enright P, Ziv E, Smith LJ, Liu K, Burchard EG. Genetic ancestry in lung-function predictions. N Engl J Med 2010;363:321-30. [PMID: 20647190 PMCID: PMC2922981 DOI: 10.1056/nejmoa0907897] [Citation(s) in RCA: 191] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]

Abstract

BACKGROUND

Self-identified race or ethnic group is used to determine normal reference standards in the prediction of pulmonary function. We conducted a study to determine whether the genetically determined percentage of African ancestry is associated with lung function and whether its use could improve predictions of lung function among persons who identified themselves as African American.

METHODS

We assessed the ancestry of 777 participants self-identified as African American in the Coronary Artery Risk Development in Young Adults (CARDIA) study and evaluated the relation between pulmonary function and ancestry by means of linear regression. We performed similar analyses of data for two independent cohorts of subjects identifying themselves as African American: 813 participants in the Health, Aging, and Body Composition (HABC) study and 579 participants in the Cardiovascular Health Study (CHS). We compared the fit of two types of models to lung-function measurements: models based on the covariates used in standard prediction equations and models incorporating ancestry. We also evaluated the effect of the ancestry-based models on the classification of disease severity in two asthma-study populations.

RESULTS

African ancestry was inversely related to forced expiratory volume in 1 second (FEV(1)) and forced vital capacity in the CARDIA cohort. These relations were also seen in the HABC and CHS cohorts. In predicting lung function, the ancestry-based model fit the data better than standard models. Ancestry-based models resulted in the reclassification of asthma severity (based on the percentage of the predicted FEV(1)) in 4 to 5% of participants.

CONCLUSIONS

Current predictive equations, which rely on self-identified race alone, may misestimate lung function among subjects who identify themselves as African American. Incorporating ancestry into normative equations may improve lung-function estimates and more accurately categorize disease severity. (Funded by the National Institutes of Health and others.)

Collapse

François O, Durand E. Spatially explicit Bayesian clustering models in population genetics. Mol Ecol Resour 2010;10:773-84. [PMID: 21565089 DOI: 10.1111/j.1755-0998.2010.02868.x] [Citation(s) in RCA: 221] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]

Intarapanich A, Shaw PJ, Assawamakin A, Wangkumhang P, Ngamphiw C, Chaichoompu K, Piriyapongsa J, Tongsima S. Iterative pruning PCA improves resolution of highly structured populations. BMC Bioinformatics 2009;10:382. [PMID: 19930644 PMCID: PMC2790469 DOI: 10.1186/1471-2105-10-382] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2009] [Accepted: 11/23/2009] [Indexed: 12/12/2022] Open

Abstract

Background

Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming.

Results

A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods.

Conclusion

The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.

Collapse

Rodríguez-Ramilo ST, Toro MA, Fernández J. Assessing population genetic structure via the maximisation of genetic distance. Genet Sel Evol 2009;41:49. [PMID: 19900278 PMCID: PMC2776585 DOI: 10.1186/1297-9686-41-49] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 11/09/2009] [Indexed: 01/23/2023] Open

Abstract

Background

The inference of the hidden structure of a population is an essential issue in population genetics. Recently, several methods have been proposed to infer population structure in population genetics.

Methods

In this study, a new method to infer the number of clusters and to assign individuals to the inferred populations is proposed. This approach does not make any assumption on Hardy-Weinberg and linkage equilibrium. The implemented criterion is the maximisation (via a simulated annealing algorithm) of the averaged genetic distance between a predefined number of clusters. The performance of this method is compared with two Bayesian approaches: STRUCTURE and BAPS, using simulated data and also a real human data set.

Results

The simulations show that with a reduced number of markers, BAPS overestimates the number of clusters and presents a reduced proportion of correct groupings. The accuracy of the new method is approximately the same as for STRUCTURE. Also, in Hardy-Weinberg and linkage disequilibrium cases, BAPS performs incorrectly. In these situations, STRUCTURE and the new method show an equivalent behaviour with respect to the number of inferred clusters, although the proportion of correct groupings is slightly better with the new method. Re-establishing equilibrium with the randomisation procedures improves the precision of the Bayesian approaches. All methods have a good precision for F_ST≥ 0.03, but only STRUCTURE estimates the correct number of clusters for F_STas low as 0.01. In situations with a high number of clusters or a more complex population structure, MGD performs better than STRUCTURE and BAPS. The results for a human data set analysed with the new method are congruent with the geographical regions previously found.

Conclusion

This new method used to infer the hidden structure in a population, based on the maximisation of the genetic distance and not taking into consideration any assumption about Hardy-Weinberg and linkage equilibrium, performs well under different simulated scenarios and with real data. Therefore, it could be a useful tool to determine genetically homogeneous groups, especially in those situations where the number of clusters is high, with complex population structure and where Hardy-Weinberg and/or linkage equilibrium are present.

Collapse

Vaughan LK, Divers J, Padilla M, Redden DT, Tiwari HK, Pomp D, Allison DB. The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal 2009;53:1755-1766. [PMID: 20161321 DOI: 10.1016/j.csda.2008.02.032] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Bingham E, Mannila H. Complexity control in a mixture model by the Hardy–Weinberg equilibrium. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.07.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Arrigo N, Tuszynski JW, Ehrich D, Gerdes T, Alvarez N. Evaluating the impact of scoring parameters on the structure of intra-specific genetic variation using RawGeno, an R package for automating AFLP scoring. BMC Bioinformatics 2009;10:33. [PMID: 19171029 PMCID: PMC2656475 DOI: 10.1186/1471-2105-10-33] [Citation(s) in RCA: 125] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2008] [Accepted: 01/26/2009] [Indexed: 11/10/2022] Open

Abstract

Background

Since the transfer and application of modern sequencing technologies to the analysis of amplified fragment-length polymorphisms (AFLP), evolutionary biologists have included an increasing number of samples and markers in their studies. Although justified in this context, the use of automated scoring procedures may result in technical biases that weaken the power and reliability of further analyses.

Results

Using a new scoring algorithm, RawGeno, we show that scoring errors – in particular "bin oversplitting" (i.e. when variant sizes of the same AFLP marker are not considered as homologous) and "technical homoplasy" (i.e. when two AFLP markers that differ slightly in size are mistakenly considered as being homologous) – induce a loss of discriminatory power, decrease the robustness of results and, in extreme cases, introduce erroneous information in genetic structure analyses. In the present study, we evaluate several descriptive statistics that can be used to optimize the scoring of the AFLP analysis, and we describe a new statistic, the information content per bin (I_bin) that represents a valuable estimator during the optimization process. This statistic can be computed at any stage of the AFLP analysis without requiring the inclusion of replicated samples. Finally, we show that downstream analyses are not equally sensitive to scoring errors. Indeed, although a reasonable amount of flexibility is allowed during the optimization of the scoring procedure without causing considerable changes in the detection of genetic structure patterns, notable discrepancies are observed when estimating genetic diversities from differently scored datasets.

Conclusion

Our algorithm appears to perform as well as a commercial program in automating AFLP scoring, at least in the context of population genetics or phylogeographic studies. To our knowledge, RawGeno is the only freely available public-domain software for fully automated AFLP scoring, from electropherogram files to user-defined working binary matrices. RawGeno was implemented in an R CRAN package (with an user-friendly GUI) and can be found at .

Collapse

Sazonova N, Harner EJ. Haplotype inference and block partitioning in mixed population samples. J Bioinform Comput Biol 2008;6:1177-92. [PMID: 19090023 DOI: 10.1142/s0219720008003898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2007] [Revised: 07/25/2008] [Accepted: 07/26/2008] [Indexed: 11/18/2022]

Yang JJ, Burchard EG, Choudhry S, Johnson CC, Ownby DR, Favro D, Chen J, Akana M, Ha C, Kwok PY, Krajenta R, Havstad SL, Joseph CL, Seibold MA, Shriver MD, Williams LK. Differences in allergic sensitization by self-reported race and genetic ancestry. J Allergy Clin Immunol 2008;122:820-827.e9. [PMID: 19014772 DOI: 10.1016/j.jaci.2008.07.044] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2008] [Revised: 07/29/2008] [Accepted: 07/30/2008] [Indexed: 01/10/2023]

Aldrich MC, Selvin S, Hansen HM, Barcellos LF, Wrensch MR, Sison JD, Quesenberry CP, Kittles RA, Silva G, Buffler PA, Seldin MF, Wiencke JK. Comparison of statistical methods for estimating genetic admixture in a lung cancer study of African Americans and Latinos. Am J Epidemiol 2008;168:1035-46. [PMID: 18791191 DOI: 10.1093/aje/kwn224] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

NADIR ALVAREZ, NILS ARRIGO, CONSORTIUM INTRABIODIV. SIMIL: anr(CRAN) scripts collection for computing genetic structure similarities based onstructure2 outputs. Mol Ecol Resour 2008;8:757-62. [DOI: 10.1111/j.1755-0998.2007.02076.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Santafé G, Lozano JA, Larrañaga P. Inference of population structure using genetic markers and a Bayesian model averaging approach for clustering. J Comput Biol 2008;15:207-20. [PMID: 18312151 DOI: 10.1089/cmb.2007.0051] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Tiwari HK, Barnholtz-Sloan J, Wineinger N, Padilla MA, Vaughan LK, Allison DB. Review and evaluation of methods correcting for population stratification with a focus on underlying statistical principles. Hum Hered 2008;66:67-86. [PMID: 18382087 PMCID: PMC2803696 DOI: 10.1159/000119107] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

NADIR ALVAREZ, NILS ARRIGO, CONSORTIUM INTRABIODIV. SIMIL: an r (CRAN) scripts collection for computing genetic structure similarities based on structure 2 outputs. Mol Ecol Resour 2008. [DOI: 10.1111/j.1471-8286.2007.02076.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Gao X, Starmer JD. AWclust: point-and-click software for non-parametric population structure analysis. BMC Bioinformatics 2008;9:77. [PMID: 18237431 PMCID: PMC2253519 DOI: 10.1186/1471-2105-9-77] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2007] [Accepted: 01/31/2008] [Indexed: 01/22/2023] Open

Hu D, Ziv E. Confounding in genetic association studies and its solutions. Methods Mol Biol 2008;448:31-39. [PMID: 18370229 DOI: 10.1007/978-1-59745-205-2_3] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]

Bonin A, Ehrich D, Manel S. Statistical analysis of amplified fragment length polymorphism data: a toolbox for molecular ecologists and evolutionists. Mol Ecol 2007;16:3737-58. [PMID: 17850542 DOI: 10.1111/j.1365-294x.2007.03435.x] [Citation(s) in RCA: 300] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Nievergelt CM, Libiger O, Schork NJ. Generalized analysis of molecular variance. PLoS Genet 2007;3:e51. [PMID: 17411342 PMCID: PMC1847693 DOI: 10.1371/journal.pgen.0030051] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2006] [Accepted: 02/22/2007] [Indexed: 01/21/2023] Open

Abstract

Many studies in the fields of genetic epidemiology and applied population genetics are predicated on, or require, an assessment of the genetic background diversity of the individuals chosen for study. A number of strategies have been developed for assessing genetic background diversity. These strategies typically focus on genotype data collected on the individuals in the study, based on a panel of DNA markers. However, many of these strategies are either rooted in cluster analysis techniques, and hence suffer from problems inherent to the assignment of the biological and statistical meaning to resulting clusters, or have formulations that do not permit easy and intuitive extensions. We describe a very general approach to the problem of assessing genetic background diversity that extends the analysis of molecular variance (AMOVA) strategy introduced by Excoffier and colleagues some time ago. As in the original AMOVA strategy, the proposed approach, termed generalized AMOVA (GAMOVA), requires a genetic similarity matrix constructed from the allelic profiles of individuals under study and/or allele frequency summaries of the populations from which the individuals have been sampled. The proposed strategy can be used to either estimate the fraction of genetic variation explained by grouping factors such as country of origin, race, or ethnicity, or to quantify the strength of the relationship of the observed genetic background variation to quantitative measures collected on the subjects, such as blood pressure levels or anthropometric measures. Since the formulation of our test statistic is rooted in multivariate linear models, sets of variables can be related to genetic background in multiple regression-like contexts. GAMOVA can also be used to complement graphical representations of genetic diversity such as tree diagrams (dendrograms) or heatmaps. We examine features, advantages, and power of the proposed procedure and showcase its flexibility by using it to analyze a wide variety of published data sets, including data from the Human Genome Diversity Project, classical anthropometry data collected by Howells, and the International HapMap Project.

Humans exhibit great genetic diversity. Understanding the factors that contribute to and sustain this diversity is an important research area. Not only can such understanding shed light on human origins, but it can also assist in the discovery of genes and genetic factors that contribute to debilitating diseases. Statistical analysis methods that can facilitate the identification of factors contributing to or associated with human genetic diversity are growing in number as new high-throughput molecular genetic assays and technologies are developed. We consider the use of an analysis method termed generalized analysis of molecular variance (GAMOVA), which builds off of previously proposed analysis methods for testing hypotheses about the factors associated with genetic background diversity. We apply the method in a wide variety of settings and show that it is both flexible and powerful. GAMOVA has great potential to assist in population-based human genetic studies, as it can be used to address questions such as: Is a sample of affected cases and unaffected controls from a homogeneous population, or is there evidence of heterogeneity that could affect the results of an association study? Is there reason to believe that the ancestry of a set of individuals influences the traits that they have?

Collapse