1
|
Mathew B, Léon J, Dadshani S, Pillen K, Sillanpää MJ, Naz AA. Importance of correcting genomic relationships in single-locus QTL mapping model with an advanced backcross population. G3 GENES|GENOMES|GENETICS 2021; 11:6211194. [PMID: 33822941 PMCID: PMC8495747 DOI: 10.1093/g3journal/jkab105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 03/18/2021] [Indexed: 11/29/2022]
Abstract
Advanced backcross (AB) populations have been widely used to identify and utilize beneficial alleles in various crops such as rice, tomato, wheat, and barley. For the development of an AB population, a controlled crossing scheme is used and this controlled crossing along with the selection (both natural and artificial) of agronomically adapted alleles during the development of AB population may lead to unbalanced allele frequencies in the population. However, it is commonly believed that interval mapping of traits in experimental crosses such as AB populations is immune to the deviations from the expected frequencies under Mendelian segregation. Using two AB populations and simulated data sets as examples, we describe the severity of the problem caused by unbalanced allele frequencies in quantitative trait loci mapping and demonstrate how it can be corrected using the linear mixed model having a polygenic effect with the covariance structure (genomic relationship matrix) calculated from molecular markers.
Collapse
Affiliation(s)
- Boby Mathew
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Jens Léon
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Said Dadshani
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| | - Klaus Pillen
- Department of Plant Breeding, Institute of Agricultural and Nutritional Sciences, Martin-Luther University Halle-Wittenberg, 06120 Halle (Saale), Germany
| | | | - Ali Ahmad Naz
- Institute of Crop Science and Resource Conservation, Department of Plant Breeding, University of Bonn, 53115 Bonn, Germany
| |
Collapse
|
2
|
Kontio JAJ, Sillanpää MJ. Scalable Nonparametric Prescreening Method for Searching Higher-Order Genetic Interactions Underlying Quantitative Traits. Genetics 2019; 213:1209-1224. [PMID: 31585953 PMCID: PMC6893368 DOI: 10.1534/genetics.119.302658] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 09/27/2019] [Indexed: 02/07/2023] Open
Abstract
Gaussian process (GP)-based automatic relevance determination (ARD) is known to be an efficient technique for identifying determinants of gene-by-gene interactions important to trait variation. However, the estimation of GP models is feasible only for low-dimensional datasets (∼200 variables), which severely limits application of the GP-based ARD method for high-throughput sequencing data. In this paper, we provide a nonparametric prescreening method that preserves virtually all the major benefits of the GP-based ARD method and extends its scalability to the typical high-dimensional datasets used in practice. In several simulated test scenarios, the proposed method compared favorably with existing nonparametric dimension reduction/prescreening methods suitable for higher-order interaction searches. As a real-data example, the proposed method was applied to a high-throughput dataset downloaded from the cancer genome atlas (TCGA) with measured expression levels of 16,976 genes (after preprocessing) from patients diagnosed with acute myeloid leukemia.
Collapse
Affiliation(s)
- Juho A J Kontio
- Research Unit of Mathematical Sciences, Biocenter Oulu, University of Oulu, 90014, Finland and
| | - Mikko J Sillanpää
- Research Unit of Mathematical Sciences, Biocenter Oulu, University of Oulu, 90014, Finland and
- Infotech Oulu, University of Oulu, 90014, Finland
| |
Collapse
|
3
|
Toosi A, Fernando RL, Dekkers JCM. Genome-wide mapping of quantitative trait loci in admixed populations using mixed linear model and Bayesian multiple regression analysis. Genet Sel Evol 2018; 50:32. [PMID: 29914353 PMCID: PMC6006859 DOI: 10.1186/s12711-018-0402-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 06/01/2018] [Indexed: 12/18/2022] Open
Abstract
Background Population stratification and cryptic relationships have been the main sources of excessive false-positives and false-negatives in population-based association studies. Many methods have been developed to model these confounding factors and minimize their impact on the results of genome-wide association studies. In most of these methods, a two-stage approach is applied where: (1) methods are used to determine if there is a population structure in the sample dataset and (2) the effects of population structure are corrected either by modeling it or by running a separate analysis within each sub-population. The objective of this study was to evaluate the impact of population structure on the accuracy and power of genome-wide association studies using a Bayesian multiple regression method. Methods We conducted a genome-wide association study in a stochastically simulated admixed population. The genome was composed of six chromosomes, each with 1000 markers. Fifteen segregating quantitative trait loci contributed to the genetic variation of a quantitative trait with heritability of 0.30. The impact of genetic relationships and breed composition (BC) on three analysis methods were evaluated: single marker simple regression (SMR), single marker mixed linear model (MLM) and Bayesian multiple-regression analysis (BMR). Each method was fitted with and without BC. Accuracy, power, false-positive rate and the positive predictive value of each method were calculated and used for comparison. Results SMR and BMR, both without BC, were ranked as the worst and the best performing approaches, respectively. Our results showed that, while explicit modeling of genetic relationships and BC is essential for models SMR and MLM, BMR can disregard them and yet result in a higher power without compromising its false-positive rate. Conclusions This study showed that the Bayesian multiple-regression analysis is robust to population structure and to relationships among study subjects and performs better than a single marker mixed linear model approach.
Collapse
Affiliation(s)
- Ali Toosi
- Cobb-Vantress Inc., 4703 US HWY 412 E, Siloam Springs, AR, 72761, USA.
| | - Rohan L Fernando
- Department of Animal Science, Iowa State University, Ames, IA, 50010, USA
| | - Jack C M Dekkers
- Department of Animal Science, Iowa State University, Ames, IA, 50010, USA
| |
Collapse
|
4
|
Genetic heterogeneity underlying variation in a locally adaptive clinal trait in Pinus sylvestris revealed by a Bayesian multipopulation analysis. Heredity (Edinb) 2016; 118:413-423. [PMID: 27901510 DOI: 10.1038/hdy.2016.115] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Revised: 08/30/2016] [Accepted: 10/11/2016] [Indexed: 11/08/2022] Open
Abstract
Local adaptation is a common feature of plant and animal populations. Adaptive phenotypic traits are genetically differentiated along environmental gradients, but the genetic basis of such adaptation is still poorly known. Genetic association studies of local adaptation combine data over populations. Correcting for population structure in these studies can be problematic since both selection and neutral demographic events can create similar allele frequency differences between populations. Correcting for demography with traditional methods may lead to eliminating some true associations. We developed a new Bayesian approach for identifying the loci underlying an adaptive trait in a multipopulation situation in the presence of possible double confounding due to population stratification and adaptation. With this method we studied the genetic basis of timing of bud set, a surrogate trait for timing of yearly growth cessation that confers local adaptation to the populations of Scots pine (Pinus sylvestris). Population means of timing of bud set were highly correlated with latitude. Most effects at individual loci were small. Interestingly, we found genetic heterogeneity (that is, different sets of loci associated with the trait) between the northern and central European parts of the cline. We also found indications of stronger stabilizing selection toward the northern part of the range. The harsh northern conditions may impose greater selective pressure on timing of growth cessation, and the relative importance of different environmental cues used for tracking the seasons might differ depending on latitude of origin.
Collapse
|
5
|
Bhattacharjee M, Rajeevan MS, Sillanpää MJ. Prediction of complex human diseases from pathway-focused candidate markers by joint estimation of marker effects: case of chronic fatigue syndrome. Hum Genomics 2015; 9:8. [PMID: 26063326 PMCID: PMC4479222 DOI: 10.1186/s40246-015-0030-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2013] [Accepted: 05/28/2015] [Indexed: 11/16/2022] Open
Abstract
Background The current practice of using only a few strongly associated genetic markers in regression models results in generally low power in prediction or accounting for heritability of complex human traits. Purpose We illustrate here a Bayesian joint estimation of single nucleotide polymorphism (SNP) effects principle to improve prediction of phenotype status from pathway-focused sets of SNPs. Chronic fatigue syndrome (CFS), a complex disease of unknown etiology with no laboratory methods for diagnosis, was chosen to demonstrate the power of this Bayesian method. For CFS, such a genetic predictive model in combination with clinical evidence might lead to an earlier diagnosis than one based solely on clinical findings. Methods One of our goals is to model disease status using Bayesian statistics which perform variable selection and parameter estimation simultaneously and which can induce the sparseness and smoothness of the SNP effects. Smoothness of the SNP effects is obtained by explicit modeling of the covariance structure of the SNP effects. Results The Bayesian model achieved perfect goodness of fit when tested within the sampled data. Tenfold cross-validation resulted in 80 % accuracy, one of the best so far for CFS in comparison to previous prediction models. Model reduction aspects were investigated in a computationally feasible manner. Additionally, genetic variation estimates provided by the model identified specific genetic markers for their biological role in the disease pathophysiology. Conclusions This proof-of-principle study provides a powerful approach combining Bayesian methods, SNPs representing multiple pathways and rigorous case ascertainment for accurate genetic risk prediction modeling of complex diseases like CFS and other chronic diseases. Electronic supplementary material The online version of this article (doi:10.1186/s40246-015-0030-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Mangalathu S Rajeevan
- Division of High-Consequence Pathogens & Pathology, Centers for Disease Control and Prevention, Atlanta, 30333, USA.
| | - Mikko J Sillanpää
- Departments of Mathematical Sciences, Biocenter Oulu, University of Oulu, Oulu, FIN-90014, Finland.
| |
Collapse
|
6
|
Würschum T, Kraft T. Evaluation of multi-locus models for genome-wide association studies: a case study in sugar beet. Heredity (Edinb) 2014; 114:281-90. [PMID: 25351864 DOI: 10.1038/hdy.2014.98] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Revised: 07/01/2014] [Accepted: 08/26/2014] [Indexed: 01/14/2023] Open
Abstract
Association mapping has become a widely applied genomic approach to dissect the genetic architecture of complex traits. A major issue for association mapping is the need to control for the confounding effects of population structure, which is commonly done by mixed models incorporating kinship information. In this case study, we employed experimental data from a large sugar beet population to evaluate multi-locus models for association mapping. As in linkage mapping, markers are selected as cofactors to control for population structure and genetic background variation. We compared different biometric models with regard to important quantitative trait locus (QTL) mapping parameters like the false-positive rate, the QTL detection power and the predictive power for the proportion of explained genotypic variance. Employing different approaches we show that the multi-locus model, that is, incorporating cofactors, outperforms the other models, including the mixed model used as a reference model. Thus, multi-locus models are an attractive alternative for association mapping to efficiently detect QTL for knowledge-based breeding.
Collapse
Affiliation(s)
- T Würschum
- University of Hohenheim, State Plant Breeding Institute, Stuttgart, Germany
| | - T Kraft
- Syngenta Seeds AB, Landskrona, Sweden
| |
Collapse
|
7
|
Pikkuhookana P, Sillanpää MJ. Combined linkage disequilibrium and linkage mapping: Bayesian multilocus approach. Heredity (Edinb) 2013; 112:351-60. [PMID: 24253936 DOI: 10.1038/hdy.2013.111] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Revised: 09/02/2013] [Accepted: 09/27/2013] [Indexed: 01/24/2023] Open
Abstract
Quantitative trait loci (QTL) affecting the phenotype of interest can be detected using linkage analysis (LA), linkage disequilibrium (LD) mapping or a combination of both (LDLA). The LA approach uses information from recombination events within the observed pedigree and LD mapping from the historical recombinations within the unobserved pedigree. We propose the Bayesian variable selection approach for combined LDLA analysis for single-nucleotide polymorphism (SNP) data. The novel approach uses both sources of information simultaneously as is commonly done in plant and animal genetics, but it makes fewer assumptions about population demography than previous LDLA methods. This differs from approaches in human genetics, where LDLA methods use LA information conditional on LD information or the other way round. We argue that the multilocus LDLA model is more powerful for the detection of phenotype-genotype associations than single-locus LDLA analysis. To illustrate the performance of the Bayesian multilocus LDLA method, we analyzed simulation replicates based on real SNP genotype data from small three-generational CEPH families and compared the results with commonly used quantitative transmission disequilibrium test (QTDT). This paper is intended to be conceptual in the sense that it is not meant to be a practical method for analyzing high-density SNP data, which is more common. Our aim was to test whether this approach can function in principle.
Collapse
Affiliation(s)
- P Pikkuhookana
- 1] Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland [2] Department of Biology, University of Oulu, Oulu, Finland [3] Department of Mathematical Sciences, University of Oulu, Oulu, Finland [4] Biocenter Oulu, University of Oulu, Oulu, Finland
| | - M J Sillanpää
- 1] Department of Biology, University of Oulu, Oulu, Finland [2] Department of Mathematical Sciences, University of Oulu, Oulu, Finland [3] Biocenter Oulu, University of Oulu, Oulu, Finland
| |
Collapse
|
8
|
Knürr T, Läärä E, Sillanpää MJ. Impact of prior specifications in a shrinkage-inducing Bayesian model for quantitative trait mapping and genomic prediction. Genet Sel Evol 2013; 45:24. [PMID: 23834140 PMCID: PMC3750442 DOI: 10.1186/1297-9686-45-24] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2012] [Accepted: 06/10/2013] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND In quantitative trait mapping and genomic prediction, Bayesian variable selection methods have gained popularity in conjunction with the increase in marker data and computational resources. Whereas shrinkage-inducing methods are common tools in genomic prediction, rigorous decision making in mapping studies using such models is not well established and the robustness of posterior results is subject to misspecified assumptions because of weak biological prior evidence. METHODS Here, we evaluate the impact of prior specifications in a shrinkage-based Bayesian variable selection method which is based on a mixture of uniform priors applied to genetic marker effects that we presented in a previous study. Unlike most other shrinkage approaches, the use of a mixture of uniform priors provides a coherent framework for inference based on Bayes factors. To evaluate the robustness of genetic association under varying prior specifications, Bayes factors are compared as signals of positive marker association, whereas genomic estimated breeding values are considered for genomic selection. The impact of specific prior specifications is reduced by calculation of combined estimates from multiple specifications. A Gibbs sampler is used to perform Markov chain Monte Carlo estimation (MCMC) and a generalized expectation-maximization algorithm as a faster alternative for maximum a posteriori point estimation. The performance of the method is evaluated by using two publicly available data examples: the simulated QTLMAS XII data set and a real data set from a population of pigs. RESULTS Combined estimates of Bayes factors were very successful in identifying quantitative trait loci, and the ranking of Bayes factors was fairly stable among markers with positive signals of association under varying prior assumptions, but their magnitudes varied considerably. Genomic estimated breeding values using the mixture of uniform priors compared well to other approaches for both data sets and loss of accuracy with the generalized expectation-maximization algorithm was small as compared to that with MCMC. CONCLUSIONS Since no error-free method to specify priors is available for complex biological phenomena, exploring a wide variety of prior specifications and combining results provides some solution to this problem. For this purpose, the mixture of uniform priors approach is especially suitable, because it comprises a wide and flexible family of distributions and computationally intensive estimation can be carried out in a reasonable amount of time.
Collapse
Affiliation(s)
- Timo Knürr
- Department of Mathematics and Statistics, P.O. Box 68, University of Helsinki, Helsinki, FIN-00014, Finland
| | - Esa Läärä
- Department of Mathematical Sciences/Statistics, P.O. Box 3000, University of Oulu, Oulu, FIN-90014, Finland
| | - Mikko J Sillanpää
- Department of Mathematics and Statistics, P.O. Box 68, University of Helsinki, Helsinki, FIN-00014, Finland
- Department of Mathematical Sciences/Statistics, P.O. Box 3000, University of Oulu, Oulu, FIN-90014, Finland
- Department of Biology and Biocenter Oulu, P.O. Box 3000, University of Oulu, Oulu, FIN-90014, Finland
- Department of Agricultural Sciences, P.O. Box 27, University of Helsinki, Helsinki, FIN-00014, Finland
| |
Collapse
|
9
|
Technow F, Riedelsheimer C, Schrag TA, Melchinger AE. Genomic prediction of hybrid performance in maize with models incorporating dominance and population specific marker effects. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2012; 125:1181-94. [PMID: 22733443 DOI: 10.1007/s00122-012-1905-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2012] [Accepted: 05/16/2012] [Indexed: 05/05/2023]
Abstract
Identifying high performing hybrids is an essential part of every maize breeding program. Genomic prediction of maize hybrid performance allows to identify promising hybrids, when they themselves or other hybrids produced from their parents were not tested in field trials. Using simulations, we investigated the effects of marker density (10, 1, 0.3 marker per mega base pair, Mbp(-1)), convergent or divergent parental populations, number of parents tested in other combinations (2, 1, 0), genetic model (including population-specific and/or dominance marker effects or not), and estimation method (GBLUP or BayesB) on the prediction accuracy. We based our simulations on marker genotypes of Central European flint and dent inbred lines from an ongoing maize breeding program. To simulate convergent or divergent parent populations, we generated phenotypes by assigning QTL to markers with similar or very different allele frequencies in both pools, respectively. Prediction accuracies increased with marker density and number of parents tested and were higher under divergent compared with convergent parental populations. Modeling marker effects as population-specific slightly improved prediction accuracy under lower marker densities (1 and 0.3 Mbp(-1)). This indicated that modeling marker effects as population-specific will be most beneficial under low linkage disequilibrium. Incorporating dominance effects improved prediction accuracies considerably for convergent parent populations, where dominance results in major contributions of SCA effects to the genetic variance among inter-population hybrids. While the general trends regarding the effects of the aforementioned influence factors on prediction accuracy were similar for GBLUP and BayesB, the latter method produced significantly higher accuracies for models incorporating dominance.
Collapse
Affiliation(s)
- Frank Technow
- Department of Applied Genetics, Institute of Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, 70599, Stuttgart, Germany
| | | | | | | |
Collapse
|
10
|
Kärkkāinen HP, Sillanpää MJ. Robustness of Bayesian multilocus association models to cryptic relatedness. Ann Hum Genet 2012; 76:510-23. [PMID: 22971009 DOI: 10.1111/j.1469-1809.2012.00729.x] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Population-based association analyses are more powerful than within-family analyses in identifying genetic loci associated with a phenotype of interest. However, if the population or sample structure is omitted from the model, population stratification and cryptic relatedness may lead to false positive and negative signals caused by relatedness between individuals, rather than association due to close linkage of the marker and the trait loci. Therefore it is important to correct or account for these confounders in population-based association analyses. However, there is cumulative evidence that when fitting a multilocus association model, the genetic relationships between the individuals can be captured by the markers themselves, bringing about a possibility to use the models without an additional correction for the population or sample structure. In this work we have further investigated this possibility in the Bayesian multilocus association model context using the extended Bayesian LASSO and the indicator-based variable selection. In particular, we have studied whether these multilocus models benefit from an insertion of an additional polygenic term representing the genetic variation not captured by the markers and taking account of the residual dependencies between the individuals. We have found that although the models may benefit from the insertion of the polygenic component, omitting the component does not damage the model performance severely.
Collapse
Affiliation(s)
- Hanni P Kärkkāinen
- Department of Agricultural Sciences, University of Helsinki, Helsinki FIN-00014, Finland
| | | |
Collapse
|
11
|
Technow F, Riedelsheimer C, Schrag TA, Melchinger AE. Genomic prediction of hybrid performance in maize with models incorporating dominance and population specific marker effects. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2012. [PMID: 22733443 DOI: 10.1007/s00122‐012‐1905‐8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Identifying high performing hybrids is an essential part of every maize breeding program. Genomic prediction of maize hybrid performance allows to identify promising hybrids, when they themselves or other hybrids produced from their parents were not tested in field trials. Using simulations, we investigated the effects of marker density (10, 1, 0.3 marker per mega base pair, Mbp(-1)), convergent or divergent parental populations, number of parents tested in other combinations (2, 1, 0), genetic model (including population-specific and/or dominance marker effects or not), and estimation method (GBLUP or BayesB) on the prediction accuracy. We based our simulations on marker genotypes of Central European flint and dent inbred lines from an ongoing maize breeding program. To simulate convergent or divergent parent populations, we generated phenotypes by assigning QTL to markers with similar or very different allele frequencies in both pools, respectively. Prediction accuracies increased with marker density and number of parents tested and were higher under divergent compared with convergent parental populations. Modeling marker effects as population-specific slightly improved prediction accuracy under lower marker densities (1 and 0.3 Mbp(-1)). This indicated that modeling marker effects as population-specific will be most beneficial under low linkage disequilibrium. Incorporating dominance effects improved prediction accuracies considerably for convergent parent populations, where dominance results in major contributions of SCA effects to the genetic variance among inter-population hybrids. While the general trends regarding the effects of the aforementioned influence factors on prediction accuracy were similar for GBLUP and BayesB, the latter method produced significantly higher accuracies for models incorporating dominance.
Collapse
Affiliation(s)
- Frank Technow
- Department of Applied Genetics, Institute of Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, 70599, Stuttgart, Germany
| | | | | | | |
Collapse
|
12
|
Mutshinda CM, Noykova N, Sillanpää MJ. A hierarchical bayesian approach to multi-trait clinical quantitative trait locus modeling. Front Genet 2012; 3:97. [PMID: 22685451 PMCID: PMC3368303 DOI: 10.3389/fgene.2012.00097] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2011] [Accepted: 05/12/2012] [Indexed: 02/04/2023] Open
Abstract
Recent advances in high-throughput genotyping and transcript profiling technologies have enabled the inexpensive production of genome-wide dense marker maps in tandem with huge amounts of expression profiles. These large-scale data encompass valuable information about the genetic architecture of important phenotypic traits. Comprehensive models that combine molecular markers and gene transcript levels are increasingly advocated as an effective approach to dissecting the genetic architecture of complex phenotypic traits. The simultaneous utilization of marker and gene expression data to explain the variation in clinical quantitative trait, known as clinical quantitative trait locus (cQTL) mapping, poses challenges that are both conceptual and computational. Nonetheless, the hierarchical Bayesian (HB) modeling approach, in combination with modern computational tools such as Markov chain Monte Carlo (MCMC) simulation techniques, provides much versatility for cQTL analysis. Sillanpää and Noykova (2008) developed a HB model for single-trait cQTL analysis in inbred line cross-data using molecular markers, gene expressions, and marker-gene expression pairs. However, clinical traits generally relate to one another through environmental correlations and/or pleiotropy. A multi-trait approach can improve on the power to detect genetic effects and on their estimation precision. A multi-trait model also provides a framework for examining a number of biologically interesting hypotheses. In this paper we extend the HB cQTL model for inbred line crosses proposed by Sillanpää and Noykova to a multi-trait setting. We illustrate the implementation of our new model with simulated data, and evaluate the multi-trait model performance with regard to its single-trait counterpart. The data simulation process was based on the multi-trait cQTL model, assuming three traits with uncorrelated and correlated cQTL residuals, with the simulated data under uncorrelated cQTL residuals serving as our test set for comparing the performances of the multi-trait and single-trait models. The simulated data under correlated cQTL residuals were essentially used to assess how well our new model can estimate the cQTL residual covariance structure. The model fitting to the data was carried out by MCMC simulation through OpenBUGS. The multi-trait model outperformed its single-trait counterpart in identifying cQTLs, with a consistently lower false discovery rate. Moreover, the covariance matrix of cQTL residuals was typically estimated to an appreciable degree of precision under the multi-trait cQTL model, making our new model a promising approach to addressing a wide range of issues facing the analysis of correlated clinical traits.
Collapse
Affiliation(s)
- Crispin M Mutshinda
- Department of Mathematics and Statistics, University of Helsinki Helsinki, Finland
| | | | | |
Collapse
|
13
|
Abstract
Numerous Bayesian methods of phenotype prediction and genomic breeding value estimation based on multilocus association models have been proposed. Computationally the methods have been based either on Markov chain Monte Carlo or on faster maximum a posteriori estimation. The demand for more accurate and more efficient estimation has led to the rapid emergence of workable methods, unfortunately at the expense of well-defined principles for Bayesian model building. In this article we go back to the basics and build a Bayesian multilocus association model for quantitative and binary traits with carefully defined hierarchical parameterization of Student's t and Laplace priors. In this treatment we consider alternative model structures, using indicator variables and polygenic terms. We make the most of the conjugate analysis, enabled by the hierarchical formulation of the prior densities, by deriving the fully conditional posterior densities of the parameters and using the acquired known distributions in building fast generalized expectation-maximization estimation algorithms.
Collapse
|
14
|
Mutshinda CM, Sillanpää MJ. Bayesian shrinkage analysis of QTLs under shape-adaptive shrinkage priors, and accurate re-estimation of genetic effects. Heredity (Edinb) 2011; 107:405-12. [PMID: 21712846 PMCID: PMC3199931 DOI: 10.1038/hdy.2011.37] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2010] [Revised: 04/01/2011] [Accepted: 04/07/2011] [Indexed: 12/15/2022] Open
Abstract
The successful implementation of Bayesian shrinkage analysis of high-dimensional regression models, as often encountered in quantitative trait locus (QTL) mapping, is contingent upon the choice of suitable sparsity-inducing priors. In practice, the shape (that is, the rate of tail decay) of such priors is typically preset, with no regard for the range of plausible alternatives and the fact that the most appropriate shape may depend on the data at hand. This study is presumably the first attempt to tackle this oversight through the shape-adaptive shrinkage prior (SASP) approach, with a focus on the mapping of QTLs in experimental crosses. Simulation results showed that the separation between genuine QTL effects and spurious ones can be made clearer using the SASP-based approach as compared with existing competitors. This feature makes our new method a promising approach to QTL mapping, where good separation is the ultimate goal. We also discuss a re-estimation procedure intended to improve the accuracy of the estimated genetic effects of detected QTLs with regard to shrinkage-induced bias, which may be particularly important in large-scale models with collinear predictors. The re-estimation procedure is relevant to any shrinkage method, and is potentially valuable for many scientific disciplines such as bioinformatics and quantitative genetics, where oversaturated models are booming.
Collapse
Affiliation(s)
- C M Mutshinda
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | |
Collapse
|
15
|
Sillanpää MJ, Pikkuhookana P, Abrahamsson S, Knürr T, Fries A, Lerceteau E, Waldmann P, García-Gil MR. Simultaneous estimation of multiple quantitative trait loci and growth curve parameters through hierarchical Bayesian modeling. Heredity (Edinb) 2011; 108:134-46. [PMID: 21792229 DOI: 10.1038/hdy.2011.56] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
A novel hierarchical quantitative trait locus (QTL) mapping method using a polynomial growth function and a multiple-QTL model (with no dependence in time) in a multitrait framework is presented. The method considers a population-based sample where individuals have been phenotyped (over time) with respect to some dynamic trait and genotyped at a given set of loci. A specific feature of the proposed approach is that, instead of an average functional curve, each individual has its own functional curve. Moreover, each QTL can modify the dynamic characteristics of the trait value of an individual through its influence on one or more growth curve parameters. Apparent advantages of the approach include: (1) assumption of time-independent QTL and environmental effects, (2) alleviating the necessity for an autoregressive covariance structure for residuals and (3) the flexibility to use variable selection methods. As a by-product of the method, heritabilities and genetic correlations can also be estimated for individual growth curve parameters, which are considered as latent traits. For selecting trait-associated loci in the model, we use a modified version of the well-known Bayesian adaptive shrinkage technique. We illustrate our approach by analysing a sub sample of 500 individuals from the simulated QTLMAS 2009 data set, as well as simulation replicates and a real Scots pine (Pinus sylvestris) data set, using temporal measurements of height as dynamic trait of interest.
Collapse
Affiliation(s)
- M J Sillanpää
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Genetic analysis of complex traits via Bayesian variable selection: the utility of a mixture of uniform priors. Genet Res (Camb) 2011; 93:303-18. [PMID: 21767461 DOI: 10.1017/s0016672311000164] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
A new estimation-based Bayesian variable selection approach is presented for genetic analysis of complex traits based on linear or logistic regression. By assigning a mixture of uniform priors (MU) to genetic effects, the approach provides an intuitive way of specifying hyperparameters controlling the selection of multiple influential loci. It aims at avoiding the difficulty of interpreting assumptions made in the specifications of priors. The method is compared in two real datasets with two other approaches, stochastic search variable selection (SSVS) and a re-formulation of Bayes B utilizing indicator variables and adaptive Student's t-distributions (IAt). The Markov Chain Monte Carlo (MCMC) sampling performance of the three methods is evaluated using the publicly available software OpenBUGS (model scripts are provided in the Supplementary material). The sensitivity of MU to the specification of hyperparameters is assessed in one of the data examples.
Collapse
|
17
|
|
18
|
Mutshinda CM, Sillanpää MJ. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics 2010; 186:1067-75. [PMID: 20805559 PMCID: PMC2975286 DOI: 10.1534/genetics.110.119586] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 08/27/2010] [Indexed: 11/18/2022] Open
Abstract
The Bayesian LASSO (BL) has been pointed out to be an effective approach to sparse model representation and successfully applied to quantitative trait loci (QTL) mapping and genomic breeding value (GBV) estimation using genome-wide dense sets of markers. However, the BL relies on a single parameter known as the regularization parameter to simultaneously control the overall model sparsity and the shrinkage of individual covariate effects. This may be idealistic when dealing with a large number of predictors whose effect sizes may differ by orders of magnitude. Here we propose the extended Bayesian LASSO (EBL) for QTL mapping and unobserved phenotype prediction, which introduces an additional level to the hierarchical specification of the BL to explicitly separate out these two model features. Compared to the adaptiveness of the BL, the EBL is "doubly adaptive" and thus, more robust to tuning. In simulations, the EBL outperformed the BL in regard to the accuracy of both effect size estimates and phenotypic value predictions, with comparable computational time. Moreover, the EBL proved to be less sensitive to tuning than the related Bayesian adaptive LASSO (BAL), which introduces locus-specific regularization parameters as well, but involves no mechanism for distinguishing between model sparsity and parameter shrinkage. Consequently, the EBL seems to point to a new direction for QTL mapping, phenotype prediction, and GBV estimation.
Collapse
Affiliation(s)
- Crispin M. Mutshinda
- Department of Mathematics and Statistics, University of Helsinki, Helsinki FIN-00014, Finland and Department of Agricultural Sciences, University of Helsinki, Helsinki FIN-00014, Finland
| | - Mikko J. Sillanpää
- Department of Mathematics and Statistics, University of Helsinki, Helsinki FIN-00014, Finland and Department of Agricultural Sciences, University of Helsinki, Helsinki FIN-00014, Finland
| |
Collapse
|
19
|
Sillanpää MJ. Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity (Edinb) 2010; 106:511-9. [PMID: 20628415 DOI: 10.1038/hdy.2010.91] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Population-based genomic association analyses are more powerful than within-family analyses. However, population stratification (unknown or ignored origin of individuals from multiple source populations) and cryptic relatedness (unknown or ignored covariance between individuals because of their relatedness) are confounding factors in population-based genomic association analyses, which inflate the false-positive rate. As a consequence, false association signals may arise in genomic data association analyses for reasons other than true association between the tested genomic factor (marker genotype, gene or protein expression) and the study phenotype. It is therefore important to correct or account for these confounders in population-based genomic data association analyses. The common correction techniques for population stratification and cryptic relatedness problems are presented here in the phenotype-marker association analysis context, and comments on their suitability for other types of genomic association analyses (for example, phenotype-expression association) are also provided. Even though many of these techniques have originally been developed in the context of human genetics, most of them are also applicable to model organisms and breeding populations.
Collapse
Affiliation(s)
- M J Sillanpää
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| |
Collapse
|