1
|
GWAS advancements to investigate disease associations and biological mechanisms. CLINICAL AND TRANSLATIONAL DISCOVERY 2024; 4:e296. [PMID: 38737752 PMCID: PMC11086745 DOI: 10.1002/ctd2.296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 04/16/2024] [Indexed: 05/14/2024]
Abstract
Genome-wide association studies (GWAS) have been instrumental in elucidating the genetic architecture of various traits and diseases. Despite the success of GWAS, inherent limitations such as identifying rare and ultra-rare variants, the potential for spurious associations, and in pinpointing causative agents can undermine diagnostic capabilities. This review provides an overview of GWAS and highlights recent advances in genetics that employ a range of methodologies, including Whole Genome Sequencing (WGS), Mendelian Randomization (MR), the Pangenome's high-quality T2T-CHM13 panel, and the Human BioMolecular Atlas Program (HuBMAP), as potential enablers of current and future GWAS research. State of the literature demonstrate the capabilities of these techniques in enhancing the statistical power of GWAS. WGS, with its comprehensive approach, captures the entire genome, surpassing the capabilities of the traditional GWAS technique focused on predefined Single Nucleotide Polymorphism (SNP) sites. The Pangenome's T2T-CHM13 panel, with its holistic approach, aids in the analysis of regions with high sequence identity, such as segmental duplications (SDs). Mendelian Randomization has advanced causative inference, improving clinical diagnostics and facilitating definitive conclusions. Furthermore, spatial biology techniques like HuBMAP, enable 3D molecular mapping of tissues at single-cell resolution, offering insights into pathology of complex traits. This study aims to elucidate and advocate for the increased application of these technologies, highlighting their potential to shape the future of GWAS research.
Collapse
|
2
|
Can polygenic risk scores help explain disease prevalence differences around the world? A worldwide investigation. BMC Genom Data 2023; 24:70. [PMID: 37986041 PMCID: PMC10662565 DOI: 10.1186/s12863-023-01168-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 10/20/2023] [Indexed: 11/22/2023] Open
Abstract
Complex disorders are caused by a combination of genetic, environmental and lifestyle factors, and their prevalence can vary greatly across different populations. The extent to which genetic risk, as identified by Genome Wide Association Study (GWAS), correlates to disease prevalence in different populations has not been investigated systematically. Here, we studied 14 different complex disorders and explored whether polygenic risk scores (PRS) based on current GWAS correlate to disease prevalence within Europe and around the world. A clear variation in GWAS-based genetic risk was observed based on ancestry and we identified populations that have a higher genetic liability for developing certain disorders. We found that for four out of the 14 studied disorders, PRS significantly correlates to disease prevalence within Europe. We also found significant correlations between worldwide disease prevalence and PRS for eight of the studied disorders with Multiple Sclerosis genetic risk having the highest correlation to disease prevalence. Based on current GWAS results, the across population differences in genetic risk for certain disorders can potentially be used to understand differences in disease prevalence and identify populations with the highest genetic liability. The study highlights both the limitations of PRS based on current GWAS but also the fact that in some cases, PRS may already have high predictive power. This could be due to the genetic architecture of specific disorders or increased GWAS power in some cases.
Collapse
|
3
|
Biobank-scale methods and projections for sparse polygenic prediction from machine learning. Sci Rep 2023; 13:11662. [PMID: 37468507 PMCID: PMC10356957 DOI: 10.1038/s41598-023-37580-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 06/23/2023] [Indexed: 07/21/2023] Open
Abstract
In this paper we characterize the performance of linear models trained via widely-used sparse machine learning algorithms. We build polygenic scores and examine performance as a function of training set size, genetic ancestral background, and training method. We show that predictor performance is most strongly dependent on size of training data, with smaller gains from algorithmic improvements. We find that LASSO generally performs as well as the best methods, judged by a variety of metrics. We also investigate performance characteristics of predictors trained on one genetic ancestry group when applied to another. Using LASSO, we develop a novel method for projecting AUC and correlation as a function of data size (i.e., for new biobanks) and characterize the asymptotic limit of performance. Additionally, for LASSO (compressed sensing) we show that performance metrics and predictor sparsity are in agreement with theoretical predictions from the Donoho-Tanner phase transition. Specifically, a future predictor trained in the Taiwan Precision Medicine Initiative for asthma can achieve an AUC of [Formula: see text] and for height a correlation of [Formula: see text] for a Taiwanese population. This is above the measured values of [Formula: see text] and [Formula: see text], respectively, for UK Biobank trained predictors applied to a European population.
Collapse
|
4
|
Statistical Methods for Disease Risk Prediction with Genotype Data. Methods Mol Biol 2023; 2629:331-347. [PMID: 36929084 DOI: 10.1007/978-1-0716-2986-4_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Single-nucleotide polymorphism (SNP) is the basic unit to understand the heritability of complex traits. One attractive application of the susceptible SNPs is to construct prediction models for assessing disease risk. Here, we introduce prediction methods for human traits using SNPs data, including the polygenic risk score (PRS), linear mixed models (LMMs), penalized regressions, and methods for controlling population stratification.
Collapse
|
5
|
Sparse kernel models provide optimization of training set design for genomic prediction in multiyear wheat breeding data. THE PLANT GENOME 2022; 15:e20254. [PMID: 36043341 DOI: 10.1002/tpg2.20254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2021] [Accepted: 07/17/2022] [Indexed: 06/15/2023]
Abstract
The success of genomic selection (GS) in breeding schemes relies on its ability to provide accurate predictions of unobserved lines at early stages. Multigeneration data provides opportunities to increase the training data size and thus, the likelihood of extracting useful information from ancestors to improve prediction accuracy. The genomic best linear unbiased predictions (GBLUPs) are performed by borrowing information through kinship relationships between individuals. Multigeneration data usually becomes heterogeneous with complex family relationship patterns that are increasingly entangled with each generation. Under these conditions, historical data may not be optimal for model training as the accuracy could be compromised. The sparse selection index (SSI) is a method for training set (TRN) optimization, in which training individuals provide predictions to some but not all predicted subjects. We added an additional trimming process to the original SSI (trimmed SSI) to remove less important training individuals for prediction. Using a large multigeneration (8 yr) wheat (Triticum aestivum L.) grain yield dataset (n = 68,836), we found increases in accuracy as more years are included in the TRN, with improvements of ∼0.05 in the GBLUP accuracy when using 5 yr of historical data relative to when using only 1 yr. The SSI method showed a small gain over the GBLUP accuracy but with an important reduction on the TRN size. These reduced TRNs were formed with a similar number of subjects from each training generation. Our results suggest that the SSI provides a more stable ranking of genotypes than the GBLUP as the TRN becomes larger.
Collapse
|
6
|
A meta-analysis of the gap between pedigree-based and genomic heritability estimates for production traits in dairy cows. Livest Sci 2022. [DOI: 10.1016/j.livsci.2022.105000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
7
|
A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
|
8
|
Integrating genome-wide association mapping of additive and dominance genetic effects to improve genomic prediction accuracy in Eucalyptus. THE PLANT GENOME 2022; 15:e20208. [PMID: 35441826 DOI: 10.1002/tpg2.20208] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 03/16/2022] [Indexed: 06/14/2023]
Abstract
Genome-wide association studies (GWAS) is a powerful and widely used approach to decipher the genetic control of complex traits. Still, a significant challenge for dissecting quantitative traits in forest trees is statistical power. This study uses a population consisting of 1,123 samples derived from two successive generations of crosses between Eucalyptus grandis (W. Hill) and E. urophylla (S.T. Blake). All samples have been phenotyped for growth and wood property traits and genotyped using the EuChip60K chip, yielding 37,832 informative single nucleotide polymorphisms (SNPs). We use multi-locus GWAS models to assess additive and dominance effects to identify markers associated with growth and wood property traits in the eucalypt hybrids. Additive and dominance association models identified 78 and 82 significant SNPs across all traits, respectively, which captured between 39 and 86% of the genomic-based heritability. We also used SNPs identified from the GWAS and SNPs using less stringent significance thresholds to evaluate predictive abilities in a genomic selection framework. Genomic selection models based on the top 1% SNPs captured a substantially greater proportion of the genetic variance of traits compared with when we used all SNPs for model training. The prediction ability of estimated breeding values improved significantly for all traits when using either the top 1% SNPs or SNPs identified using a relaxed p value threshold (p < 10-3 ). This study also highlights the added value of incorporating dominance effects for identifying genomic regions controlling growth traits in trees. Moreover, integrating GWAS results into genomic selection method provides enhanced power relative to discrete associations for identifying genomic variation potentially valuable for forest tree breeding.
Collapse
|
9
|
Average semivariance directly yields accurate estimates of the genomic variance in complex trait analyses. G3 GENES|GENOMES|GENETICS 2022; 12:6571389. [PMID: 35442424 PMCID: PMC9157152 DOI: 10.1093/g3journal/jkac080] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Accepted: 03/17/2022] [Indexed: 11/23/2022]
Abstract
Many important traits in plants, animals, and microbes are polygenic and challenging to improve through traditional marker-assisted selection. Genomic prediction addresses this by incorporating all genetic data in a mixed model framework. The primary method for predicting breeding values is genomic best linear unbiased prediction, which uses the realized genomic relationship or kinship matrix (K) to connect genotype to phenotype. Genomic relationship matrices share information among entries to estimate the observed entries’ genetic values and predict unobserved entries’ genetic values. One of the main parameters of such models is genomic variance (σg2), or the variance of a trait associated with a genome-wide sample of DNA polymorphisms, and genomic heritability (hg2); however, the seminal papers introducing different forms of K often do not discuss their effects on the model estimated variance components despite their importance in genetic research and breeding. Here, we discuss the effect of several standard methods for calculating the genomic relationship matrix on estimates of σg2 and hg2. With current approaches, we found that the genomic variance tends to be either overestimated or underestimated depending on the scaling and centering applied to the marker matrix (Z), the value of the average diagonal element of K, and the assortment of alleles and heterozygosity (H) in the observed population. Using the average semivariance, we propose a new matrix, KASV, that directly yields accurate estimates of σg2 and hg2 in the observed population and produces best linear unbiased predictors equivalent to routine methods in plants and animals.
Collapse
|
10
|
Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results. Heredity (Edinb) 2022; 128:519-530. [PMID: 35508540 DOI: 10.1038/s41437-022-00539-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Revised: 04/19/2022] [Accepted: 04/19/2022] [Indexed: 11/09/2022] Open
Abstract
We evaluated the performances of three BLUP and five Bayesian methods for genomic prediction by using nine actual and 54 simulated datasets. The genomic prediction accuracy was measured using Pearson's correlation coefficient between the genomic estimated breeding value (GEBV) and the observed phenotypic data using a fivefold cross-validation approach with 100 replications. The Bayesian alphabets performed better for the traits governed by a few genes/QTLs with relatively larger effects. On the contrary, the BLUP alphabets (GBLUP and CBLUP) exhibited higher genomic prediction accuracy for the traits controlled by several small-effect QTLs. Additionally, Bayesian methods performed better for the highly heritable traits and, for other traits, performed at par with the BLUP methods. Further, genomic BLUP (GBLUP) was identified as the least biased method for the GEBV estimation. Among the Bayesian methods, the Bayesian ridge regression and Bayesian LASSO were less biased than other Bayesian alphabets. Nonetheless, genomic prediction accuracy increased with an increase in trait heritability, irrespective of the sample size, marker density, and the QTL type (major/minor effect). In sum, this study provides valuable information regarding the choice of the selection method for genomic prediction in different breeding programs.
Collapse
|
11
|
A Bayesian hierarchical score for structure learning from related data sets. Int J Approx Reason 2022. [DOI: 10.1016/j.ijar.2021.11.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
12
|
Abstract
Traditional tree improvement is cumbersome and costly. Our main objective was to assess the extent to which genomic data can currently accelerate and improve decision making in this field. We used diameter at breast height (DBH) and wood density (WD) data for 4430 tree genotypes and single-nucleotide polymorphism (SNP) data for 2446 tree genotypes. Pedigree reconstruction was performed using a combination of maximum likelihood parentage assignment and matching based on identity-by-state (IBS) similarity. In addition, we used best linear unbiased prediction (BLUP) methods to predict phenotypes using SNP markers (GBLUP), recorded pedigree information (ABLUP), and single-step “blended” BLUP (HBLUP) combining SNP and pedigree information. We substantially improved the accuracy of pedigree records, resolving the inconsistent parental information of 506 tree genotypes. This led to substantially increased predictive ability (i.e., by up to 87%) in HBLUP analyses compared to a baseline from ABLUP. Genomic prediction was possible across populations and within previously untested families with moderately large training populations (N = 800–1200 tree genotypes) and using as few as 2000–5000 SNP markers. HBLUP was generally more effective than traditional ABLUP approaches, particularly after dealing appropriately with pedigree uncertainties. Our study provides evidence that genome-wide marker data can significantly enhance tree improvement. The operational implementation of genomic selection has started in radiata pine breeding in New Zealand, but further reductions in DNA extraction and genotyping costs may be required to realise the full potential of this approach.
Collapse
|
13
|
Genetic Bases of Complex Traits: From Quantitative Trait Loci to Prediction. Methods Mol Biol 2022; 2467:1-44. [PMID: 35451771 DOI: 10.1007/978-1-0716-2205-6_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Conceived as a general introduction to the book, this chapter is a reminder of the core concepts of genetic mapping and molecular marker-based prediction. It provides an overview of the principles and the evolution of methods for mapping the variation of complex traits, and methods for QTL-based prediction of human disease risk and animal and plant breeding value. The principles of linkage-based and linkage disequilibrium-based QTL mapping methods are described in the context of the simplest, single-marker, methods. Methodological evolutions are analysed in relation with their ability to account for the complexity of the genotype-phenotype relations. Main characteristics of the genetic architecture of complex traits, drawn from QTL mapping works using large populations of unrelated individuals, are presented. Methods combining marker-QTL association data into polygenic risk score that captures part of an individual's susceptibility to complex diseases are reviewed. Principles of best linear mixed model-based prediction of breeding value in animal- and plant-breeding programs using phenotypic and pedigree data, are summarized and methods for moving from BLUP to marker-QTL BLUP are presented. Factors influencing the additional genetic progress achieved by using molecular data and rules for their optimization are discussed.
Collapse
|
14
|
eQTLs as causal instruments for the reconstruction of hormone linked gene networks. Front Endocrinol (Lausanne) 2022; 13:949061. [PMID: 36060942 PMCID: PMC9428692 DOI: 10.3389/fendo.2022.949061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 07/25/2022] [Indexed: 11/17/2022] Open
Abstract
Hormones act within in highly dynamic systems and much of the phenotypic response to variation in hormone levels is mediated by changes in gene expression. The increase in the number and power of large genetic association studies has led to the identification of hormone linked genetic variants. However, the biological mechanisms underpinning the majority of these loci are poorly understood. The advent of affordable, high throughput next generation sequencing and readily available transcriptomic databases has shown that many of these genetic variants also associate with variation in gene expression levels as expression Quantitative Trait Loci (eQTLs). In addition to further dissecting complex genetic variation, eQTLs have been applied as tools for causal inference. Many hormone networks are driven by transcription factors, and many of these genes can be linked to eQTLs. In this mini-review, we demonstrate how causal inference and gene networks can be used to describe the impact of hormone linked genetic variation upon the transcriptome within an endocrinology context.
Collapse
|
15
|
Smooth-threshold multivariate genetic prediction incorporating gene–environment interactions. G3 GENES|GENOMES|GENETICS 2021; 11:6343458. [PMID: 34849749 PMCID: PMC8664495 DOI: 10.1093/g3journal/jkab278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 07/12/2021] [Indexed: 11/17/2022]
Abstract
We propose a genetic prediction modeling approach for genome-wide association study (GWAS) data that can include not only marginal gene effects but also gene–environment (GxE) interaction effects—i.e., multiplicative effects of environmental factors with genes rather than merely additive effects of each. The proposed approach is a straightforward extension of our previous multiple regression-based method, STMGP (smooth-threshold multivariate genetic prediction), with the new feature being that genome-wide test statistics from a GxE interaction analysis are used to weight the corresponding variants. We develop a simple univariate regression approximation to the GxE interaction effect that allows a direct fit of the STMGP framework without modification. The sparse nature of our model automatically removes irrelevant predictors (including variants and GxE combinations), and the model is able to simultaneously incorporate multiple environmental variables. Simulation studies to evaluate the proposed method in comparison with other modeling approaches demonstrate its superior performance under the presence of GxE interaction effects. We illustrate the usefulness of our prediction model through application to real GWAS data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).
Collapse
|
16
|
|
17
|
Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices. Heredity (Edinb) 2021; 127:423-432. [PMID: 34564692 PMCID: PMC8551287 DOI: 10.1038/s41437-021-00474-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 09/10/2021] [Accepted: 09/11/2021] [Indexed: 02/07/2023] Open
Abstract
Genomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5-17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.
Collapse
|
18
|
Genetic prediction of complex traits with polygenic scores: a statistical review. Trends Genet 2021; 37:995-1011. [PMID: 34243982 PMCID: PMC8511058 DOI: 10.1016/j.tig.2021.06.004] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 05/31/2021] [Accepted: 06/03/2021] [Indexed: 01/03/2023]
Abstract
Accurate genetic prediction of complex traits can facilitate disease screening, improve early intervention, and aid in the development of personalized medicine. Genetic prediction of complex traits requires the development of statistical methods that can properly model polygenic architecture and construct a polygenic score (PGS). We present a comprehensive review of 46 methods for PGS construction. We connect the majority of these methods through a multiple linear regression framework which can be instrumental for understanding their prediction performance for traits with distinct genetic architectures. We discuss the practical considerations of PGS analysis as well as challenges and future directions of PGS method development. We hope our review serves as a useful reference both for statistical geneticists who develop PGS methods and for data analysts who perform PGS analysis.
Collapse
|
19
|
Abstract
As genome-wide association studies have continued to identify loci associated with complex traits, the implications of and necessity for proper use of these findings, including prediction of disease risk, have become apparent. Many complex diseases have numerous associated loci with detectable effects implicating risk for or protection from disease. A common contemporary approach to using this information for disease prediction is through the application of genetic risk scores. These scores estimate an individual's liability for a specific outcome by aggregating the effects of associated loci into a single measure as described in the previous version of this article. Although genetic risk scores have traditionally included variants that meet criteria for genome-wide significance, an extension known as the polygenic risk score has been developed to include the effects of more variants across the entire genome. Here, we describe common methods and software packages for calculating and interpreting polygenic risk scores. In this revised version of the article, we detail information that is needed to perform a polygenic risk score analysis, considerations for planning the analysis and interpreting results, as well as discussion of the limitations based on the choices made. We also provide simulated sample data and a walkthrough for four different polygenic risk score software. © 2021 Wiley Periodicals LLC.
Collapse
|
20
|
Leveraging Methylation Alterations to Discover Potential Causal Genes Associated With the Survival Risk of Cervical Cancer in TCGA Through a Two-Stage Inference Approach. Front Genet 2021; 12:667877. [PMID: 34149809 PMCID: PMC8206792 DOI: 10.3389/fgene.2021.667877] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/19/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Multiple genes were previously identified to be associated with cervical cancer; however, the genetic architecture of cervical cancer remains unknown and many potential causal genes are yet to be discovered. METHODS To explore potential causal genes related to cervical cancer, a two-stage causal inference approach was proposed within the framework of Mendelian randomization, where the gene expression was treated as exposure, with methylations located within the promoter regions of genes serving as instrumental variables. Five prediction models were first utilized to characterize the relationship between the expression and methylations for each gene; then, the methylation-regulated gene expression (MReX) was obtained and the association was evaluated via Cox mixed-effect model based on MReX. We further implemented the aggregated Cauchy association test (ACAT) combination to take advantage of respective strengths of these prediction models while accounting for dependency among the p-values. RESULTS A total of 14 potential causal genes were discovered to be associated with the survival risk of cervical cancer in TCGA when the five prediction models were separately employed. The total number of potential causal genes was brought to 23 when conducting ACAT. Some of the newly discovered genes may be novel (e.g., YJEFN3, SPATA5L1, IMMP1L, C5orf55, PPIP5K2, ZNF330, CRYZL1, PPM1A, ESCO2, ZNF605, ZNF225, ZNF266, FICD, and OSTC). Functional analyses showed that these genes were enriched in tumor-associated pathways. Additionally, four genes (i.e., COL6A1, SYDE1, ESCO2, and GIPC1) were differentially expressed between tumor and normal tissues. CONCLUSION Our study discovered promising candidate genes that were causally associated with the survival risk of cervical cancer and thus provided new insights into the genetic etiology of cervical cancer.
Collapse
|
21
|
Statistical power and heritability in whole-genome association studies for quantitative traits. Meta Gene 2021. [DOI: 10.1016/j.mgene.2021.100869] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
22
|
Genetic architecture affecting maize agronomic traits identified by variance heterogeneity association mapping. Genomics 2021; 113:1681-1688. [PMID: 33839267 DOI: 10.1016/j.ygeno.2021.04.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 03/16/2021] [Accepted: 04/05/2021] [Indexed: 11/22/2022]
Abstract
Conventional genome-wide association studies (GWAS) focused on the phenotypic mean differences (mGWAS) but often ignored genetic variants influencing differences in the variance between genotypes. In this study, we performed variance heterogeneity GWAS (vGWAS) analysis for 13 previously measured agronomic traits in a maize population. We discovered a total of 129 significant SNPs. We demonstrated that the genetic loci influencing mean differences and variance heterogeneity formed distinct groups, suggesting that breeders were able to independently select for phenotype mean and variance values. Moreover, vGWAS served as a tractable approach to effectively identify 214 epistatic interaction pairs. In addition, we documented four agronomic traits with decreasing phenotype variance during modern maize breeding history and identified the potential genetic variants influencing this process. In summary, we discovered additional non-additive effects contributing to missing heritability and valuable genetic variants used for breeding varieties with desired phenotypic variance.
Collapse
|
23
|
Optimal breeding-value prediction using a sparse selection index. Genetics 2021; 218:6179494. [PMID: 33748861 PMCID: PMC8128408 DOI: 10.1093/genetics/iyab030] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 02/13/2021] [Indexed: 02/06/2023] Open
Abstract
Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.
Collapse
|
24
|
Genetic correlations between traits associated with hyperuricemia, gout, and comorbidities. Eur J Hum Genet 2021; 29:1438-1445. [PMID: 33637890 DOI: 10.1038/s41431-021-00830-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 12/06/2020] [Accepted: 02/10/2021] [Indexed: 01/26/2023] Open
Abstract
Hypertension, obesity, chronic kidney disease and type 2 diabetes are comorbidities that have very high prevalence among persons with hyperuricemia (serum urate > 6.8 mg/dL) and gout. Here we use multivariate genetic models to test the hypothesis that the co-association of traits representing hyperuricemia and its comorbidities is genetically based. Using Bayesian whole-genome regression models, we estimated the genetic marker-based variance and the covariance between serum urate, serum creatinine, systolic blood pressure (SBP), blood glucose and body mass index (BMI) from two independent family-based studies: The Framingham Heart Study-FHS and the Hypertension Genetic Epidemiology Network study-HyperGEN. The main genetic findings that replicated in both FHS and HyperGEN, were (1) creatinine was genetically correlated only with urate and (2) BMI was genetically correlated with urate, SBP, and glucose. The environmental covariance among the traits was generally highest for trait pairs involving BMI. The genetic overlap of traits representing the comorbidities of hyperuricemia and gout appears to cluster in two separate axes of genetic covariance. Because creatinine is genetically correlated with urate but not with metabolic traits, this suggests there is one genetic module of shared loci associated with hyperuricemia and chronic kidney disease. Another module of shared loci may account for the association of hyperuricemia and metabolic syndrome. This study provides a clear quantitative genetic basis for the clustering of comorbidities with hyperuricemia.
Collapse
|
25
|
Comparative accuracies of genetic values predicted for economically important milk traits, genome-wide association, and linkage disequilibrium patterns of Canadian Holstein cows. J Dairy Sci 2020; 104:1900-1916. [PMID: 33358789 DOI: 10.3168/jds.2020-18489] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 08/10/2020] [Indexed: 11/19/2022]
Abstract
Genomic selection methodologies and genome-wide association studies use powerful statistical procedures that correlate large amounts of high-density SNP genotypes and phenotypic data. Actual 305-d milk (MY), fat (FY), and protein (PY) yield data on 695 cows and 76,355 genotyping-by-sequencing-generated SNP marker genotypes from Canadian Holstein dairy cows were used to characterize linkage disequilibrium (LD) structure of Canadian Holstein cows. Also, the comparison of pedigree-based BLUP, genomic BLUP (GBLUP), and Bayesian (BayesB) statistical methods in the genomic selection methodologies and the comparison of Bayesian ridge regression and BayesB statistical methods in the genome-wide association studies were carried out for MY, FY, and PY. Results from LD analysis revealed that as marker distance decreases, LD increases through chromosomes. However, unexpected high peaks in LD were observed between marker pairs with larger marker distances on all chromosomes. The GBLUP and BayesB models resulted in similar heritability estimates through 10-fold cross-validation for MY and PY; however, the GBLUP model resulted in higher heritability estimates than BayesB model for FY. The predictive ability of GBLUP model was significantly lower than that of BayesB for MY, FY, and PY. Association analyses indicated that 28 high-effect markers and markers on Bos taurus autosome 14 located within 6 genes (DOP1B, TONSL, CPSF1, ADCK5, PARP10, and GRINA) associated significantly with FY.
Collapse
|
26
|
Leveraging Multiple Layers of Data To Predict Drosophila Complex Traits. G3 (BETHESDA, MD.) 2020; 10:4599-4613. [PMID: 33106232 PMCID: PMC7718734 DOI: 10.1534/g3.120.401847] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 10/12/2020] [Indexed: 02/07/2023]
Abstract
The ability to accurately predict complex trait phenotypes from genetic and genomic data are critical for the implementation of personalized medicine and precision agriculture; however, prediction accuracy for most complex traits is currently low. Here, we used data on whole genome sequences, deep RNA sequencing, and high quality phenotypes for three quantitative traits in the ∼200 inbred lines of the Drosophila melanogaster Genetic Reference Panel (DGRP) to compare the prediction accuracies of gene expression and genotypes for three complex traits. We found that expression levels (r = 0.28 and 0.38, for females and males, respectively) provided higher prediction accuracy than genotypes (r = 0.07 and 0.15, for females and males, respectively) for starvation resistance, similar prediction accuracy for chill coma recovery (null for both models and sexes), and lower prediction accuracy for startle response (r = 0.15 and 0.14 for female and male genotypes, respectively; and r = 0.12 and 0.11, for females and male transcripts, respectively). Models including both genotype and expression levels did not outperform the best single component model. However, accuracy increased considerably for all the three traits when we included gene ontology (GO) category as an additional layer of information for both genomic variants and transcripts. We found strongly predictive GO terms for each of the three traits, some of which had a clear plausible biological interpretation. For example, for starvation resistance in females, GO:0033500 (r = 0.39 for transcripts) and GO:0032870 (r = 0.40 for transcripts), have been implicated in carbohydrate homeostasis and cellular response to hormone stimulus (including the insulin receptor signaling pathway), respectively. In summary, this study shows that integrating different sources of information improved prediction accuracy and helped elucidate the genetic architecture of three Drosophila complex phenotypes.
Collapse
|
27
|
Adoption and Optimization of Genomic Selection To Sustain Breeding for Apricot Fruit Quality. G3-GENES GENOMES GENETICS 2020; 10:4513-4529. [PMID: 33067307 PMCID: PMC7718743 DOI: 10.1534/g3.120.401452] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Genomic selection (GS) is a breeding approach which exploits genome-wide information and whose unprecedented success has shaped several animal and plant breeding schemes through delivering their genetic progress. This is the first study assessing the potential of GS in apricot (Prunus armeniaca) to enhance postharvest fruit quality attributes. Genomic predictions were based on a F1 pseudo-testcross population, comprising 153 individuals with contrasting fruit quality traits. They were phenotyped for physical and biochemical fruit metrics in contrasting climatic conditions over two years. Prediction accuracy (PA) varied from 0.31 for glucose content with the Bayesian LASSO (BL) to 0.78 for ethylene production with RR-BLUP, which yielded the most accurate predictions in comparison to Bayesian models and only 10% out of 61,030 SNPs were sufficient to reach accurate predictions. Useful insights were provided on the genetic architecture of apricot fruit quality whose integration in prediction models improved their performance, notably for traits governed by major QTL. Furthermore, multivariate modeling yielded promising outcomes in terms of PA within training partitions partially phenotyped for target traits. This provides a useful framework for the implementation of indirect selection based on easy-to-measure traits. Thus, we highlighted the main levers to take into account for the implementation of GS for fruit quality in apricot, but also to improve the genetic gain in perennial species.
Collapse
|
28
|
Extended application of genomic selection to screen multiomics data for prognostic signatures of prostate cancer. Brief Bioinform 2020; 22:5902820. [PMID: 32898860 DOI: 10.1093/bib/bbaa197] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 06/29/2020] [Accepted: 08/02/2020] [Indexed: 12/30/2022] Open
Abstract
Prognostic tests using expression profiles of several dozen genes help provide treatment choices for prostate cancer (PCa). However, these tests require improvement to meet the clinical need for resolving overtreatment, which continues to be a pervasive problem in PCa management. Genomic selection (GS) methodology, which utilizes whole-genome markers to predict agronomic traits, was adopted in this study for PCa prognosis. We leveraged The Cancer Genome Atlas (TCGA) database to evaluate the prediction performance of six GS methods and seven omics data combinations, which showed that the Best Linear Unbiased Prediction (BLUP) model outperformed the other methods regarding predictability and computational efficiency. Leveraging the BLUP-HAT method, an accelerated version of BLUP, we demonstrated that using expression data of a large number of disease-relevant genes and with an integration of other omics data (i.e. miRNAs) significantly increased outcome predictability when compared with panels consisting of a small number of genes. Finally, we developed a novel stepwise forward selection BLUP-HAT method to facilitate searching multiomics data for predictor variables with prognostic potential. The new method was applied to the TCGA data to derive mRNA and miRNA expression signatures for predicting relapse-free survival of PCa, which were validated in six independent cohorts. This is a transdisciplinary adoption of the highly efficient BLUP-HAT method and its derived algorithms to analyze multiomics data for PCa prognosis. The results demonstrated the efficacy and robustness of the new methodology in developing prognostic models in PCa, suggesting a potential utility in managing other types of cancer.
Collapse
|
29
|
How Can Gene-Expression Information Improve Prognostic Prediction in TCGA Cancers: An Empirical Comparison Study on Regularization and Mixed Cox Models. Front Genet 2020; 11:920. [PMID: 32973875 PMCID: PMC7472843 DOI: 10.3389/fgene.2020.00920] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 07/23/2020] [Indexed: 12/30/2022] Open
Abstract
Background Previous cancer prognostic prediction models often consider only the most important transcriptomic expressions, and their power is limited. It is unknown whether prediction power can be further improved when additional transcriptomic information is incorporated. Methods To integrate transcriptomes, four models are compared based on 32 types of cancer in the Cancer Genome Atlas, including the general Cox model with only clinical covariates, the Cox model with a lasso penalty (coxlasso), the Cox model with an elastic net penalty (coxenet), and the mixed-effects Cox model (coxlmm). Furthermore, we partition the survival variance into the relative contribution of clinical and transcriptomic components within the framework of coxlmm. Finally, the influence of different numbers of genes was evaluated in the context of coxlmm. Results Compared with the clinical covariates–only Cox model, the average prediction gain was 2.4% for coxlasso, 4.2% for coxenet, and 7.2% for coxlmm across 16 low-censored cancers; a significant elevation of prediction power was observed for SARC, SKCM, LGG, PAAD, and HNSC. Similar findings were observed for all 32 cancers with the average prediction gain of 2.7, 3.8, and 5.8% for coxlasso, coxenet, and coxlmm. Coxlmm always had comparable or better prediction performance relative to coxlasso and coxenet with an average of 2.8% prediction improvement across the 16 low-censored cancers. In addition, it is shown that the predictive accuracy of coxlmm generally increases with the number of genes included. The survival variance partition analysis demonstrates that the transcriptomic contribution was higher for some cancers (e.g., LGG, CESC, PAAD, SKCM, and SARC) and lower for others (e.g., BRCA, COAD, KIRC, and STAD). Conclusion This study demonstrates that the integration of transcriptomic information can substantially improve prognostic prediction accuracy, but the prediction performance is cancer-specific and varies across cancer types. It further reveals that gene expression exhibits distinct contributions to survival variation across cancers.
Collapse
|
30
|
Genetic variants and underlying mechanisms influencing variance heterogeneity in maize. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2020; 103:1089-1102. [PMID: 32344461 DOI: 10.1111/tpj.14786] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 04/04/2020] [Accepted: 04/20/2020] [Indexed: 06/11/2023]
Abstract
Traditional genetic studies focus on identifying genetic variants associated with the mean difference in a quantitative trait. Because genetic variants also influence phenotypic variation via heterogeneity, we conducted a variance-heterogeneity genome-wide association study to examine the contribution of variance heterogeneity to oil-related quantitative traits. We identified 79 unique variance-controlling single nucleotide polymorphisms (vSNPs) from the sequences of 77 candidate variance-heterogeneity genes for 21 oil-related traits using the Levene test (P < 1.0 × 10-5 ). About 30% of the candidate genes encode enzymes that work in lipid metabolic pathways, most of which define clear expression variance quantitative trait loci. Of the vSNPs specifically associated with the genetic variance heterogeneity of oil concentration, 89% can be explained by additional linked mean-effects genetic variants. Furthermore, we demonstrated that gene × gene interactions play important roles in the formation of variance heterogeneity for fatty acid compositional traits. The interaction pattern was validated for one gene pair (GRMZM2G035341 and GRMZM2G152328) using yeast two-hybrid and bimolecular fluorescent complementation analyses. Our findings have implications for uncovering the genetic basis of hidden additive genetic effects and epistatic interaction effects, and we indicate opportunities to stabilize efficient breeding and selection of high-oil maize (Zea mays L.).
Collapse
|
31
|
Genetic Basis of Blood-Based Traits and Their Relationship With Performance and Environment in Beef Cattle at Weaning. Front Genet 2020; 11:717. [PMID: 32719722 PMCID: PMC7350949 DOI: 10.3389/fgene.2020.00717] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Accepted: 06/12/2020] [Indexed: 12/16/2022] Open
Abstract
The objectives of this study were to explore the usefulness of blood-based traits as indicators of health and performance in beef cattle at weaning and identify the genetic basis underlying the different blood parameters obtained from complete blood counts (CBCs). Disease costs represent one of the main factors determining profitability in animal production. Previous research has observed associations between blood cell counts and an animal’s health status in some species. CBC were recorded from approximately 570 Angus based, crossbred beef calves at weaning born between 2015 and 2016 and raised on toxic or novel tall fescue. The calves (N = ∼600) were genotyped at a density of 50k SNPs and the genotypes (N = 1160) were imputed to a density of 270k SNPs. Genetic parameters were estimated for 15 blood and 4 production. Finally, with the objective of identifying the genetic basis underlying the different blood-based traits, genome-wide association studies (GWAS) were performed for all traits. Heritability estimates ranged from 0.11 to 0.60, and generally weak phenotypic correlations and strong genetic correlations were observed among blood-based traits only. Genome-wide association study identified ninety-one 1-Mb windows that accounted for 0.5% or more of the estimated genetic variance for at least 1 trait with 21 windows overlapping across two or more traits (explaining more than 0.5% of estimated genetic variance for two or more traits). Five candidate genes have been identified in the most interesting overlapping regions related to blood-based traits. Overall, this study represents one of the first efforts represented in scientific literature to identify the genetic basis of blood cell traits in beef cattle. The results presented in this study allow us to conclude that: (1) blood-based traits have weak phenotypic correlations but strong genetic correlations among themselves. (2) Blood-based traits have moderate to high heritability. (3) There is evidence of an important overlap of genetic control among similar blood-based traits which will allow for their use in improvement programs in beef cattle.
Collapse
|
32
|
Re-evaluating the relationship between missing heritability and the microbiome. MICROBIOME 2020; 8:87. [PMID: 32513310 PMCID: PMC7282175 DOI: 10.1186/s40168-020-00839-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Accepted: 04/15/2020] [Indexed: 06/07/2023]
Abstract
Human genome-wide association studies (GWASs) have recurrently estimated lower heritability estimates than familial studies. Many explanations have been suggested to explain these lower estimates, including that a substantial proportion of genetic variation and gene-by-environment interactions are unmeasured in typical GWASs. The human microbiome is potentially related to both of these explanations, but it has been more commonly considered as a source of unmeasured genetic variation. In particular, it has recently been argued that the genetic variation within the human microbiome should be included when estimating trait heritability. We outline issues with this argument, which in its strictest form depends on the holobiont model of human-microbiome interactions. Instead, we argue that the microbiome could be leveraged to help control for environmental variation across a population, although that remains to be determined. We discuss potential approaches that could be explored to determine whether integrating microbiome sequencing data into GWASs is useful. Video abstract.
Collapse
|
33
|
Multi-trait Genomic Selection Methods for Crop Improvement. Genetics 2020; 215:931-945. [PMID: 32482640 DOI: 10.1534/genetics.120.303305] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 05/26/2020] [Indexed: 11/18/2022] Open
Abstract
Plant breeders make selection decisions based on multiple traits, such as yield, plant height, flowering time, and disease resistance. A commonly used approach in multi-trait genomic selection is index selection, which assigns weights to different traits relative to their economic importance. However, classical index selection only optimizes genetic gain in the next generation, requires some experimentation to find weights that lead to desired outcomes, and has difficulty optimizing nonlinear breeding objectives. Multi-objective optimization has also been used to identify the Pareto frontier of selection decisions, which represents different trade-offs across multiple traits. We propose a new approach, which maximizes certain traits while keeping others within desirable ranges. Optimal selection decisions are made using a new version of the look-ahead selection (LAS) algorithm, which was recently proposed for single-trait genomic selection, and achieved superior performance with respect to other state-of-the-art selection methods. To demonstrate the effectiveness of the new method, a case study is developed using a realistic data set where our method is compared with conventional index selection. Results suggest that the multi-trait LAS is more effective at balancing multiple traits compared with index selection.
Collapse
|
34
|
Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet 2020; 106:679-693. [PMID: 32330416 PMCID: PMC7212266 DOI: 10.1016/j.ajhg.2020.03.013] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 03/30/2020] [Indexed: 01/24/2023] Open
Abstract
Accurate construction of polygenic scores (PGS) can enable early diagnosis of diseases and facilitate the development of personalized medicine. Accurate PGS construction requires prediction models that are both adaptive to different genetic architectures and scalable to biobank scale datasets with millions of individuals and tens of millions of genetic variants. Here, we develop such a method called Deterministic Bayesian Sparse Linear Mixed Model (DBSLMM). DBSLMM relies on a flexible modeling assumption on the effect size distribution to achieve robust and accurate prediction performance across a range of genetic architectures. DBSLMM also relies on a simple deterministic search algorithm to yield an approximate analytic estimation solution using summary statistics only. The deterministic search algorithm, when paired with further algebraic innovations, results in substantial computational savings. With simulations, we show that DBSLMM achieves scalable and accurate prediction performance across a range of realistic genetic architectures. We then apply DBSLMM to analyze 25 traits in UK Biobank. For these traits, compared to existing approaches, DBSLMM achieves an average of 2.03%-101.09% accuracy gain in internal cross-validations. In external validations on two separate datasets, including one from BioBank Japan, DBSLMM achieves an average of 14.74%-522.74% accuracy gain. In these real data applications, DBSLMM is 1.03-28.11 times faster and uses only 7.4%-24.8% of physical memory as compared to other multiple regression-based PGS methods. Overall, DBSLMM represents an accurate and scalable method for constructing PGS in biobank scale datasets.
Collapse
|
35
|
Multikernel linear mixed model with adaptive lasso for complex phenotype prediction. Stat Med 2020; 39:1311-1327. [PMID: 31985088 DOI: 10.1002/sim.8477] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/17/2019] [Accepted: 12/24/2019] [Indexed: 12/15/2022]
Abstract
Linear mixed models (LMMs) and their extensions have been widely used for high-dimensional genomic data analyses. While LMMs hold great promise for risk prediction research, the high dimensionality of the data and different effect sizes of genomic regions bring great analytical and computational challenges. In this work, we present a multikernel linear mixed model with adaptive lasso (KLMM-AL) to predict phenotypes using high-dimensional genomic data. We develop two algorithms for estimating parameters from our model and also establish the asymptotic properties of LMM with adaptive lasso when only one dependent observation is available. The proposed KLMM-AL can account for heterogeneous effect sizes from different genomic regions, capture both additive and nonadditive genetic effects, and adaptively and efficiently select predictive genomic regions and their corresponding effects. Through simulation studies, we demonstrate that KLMM-AL outperforms most of existing methods. Moreover, KLMM-AL achieves high sensitivity and specificity of selecting predictive genomic regions. KLMM-AL is further illustrated by an application to the sequencing dataset obtained from the Alzheimer's disease neuroimaging initiative.
Collapse
|
36
|
Improving predictive models for Alzheimer's disease using GWAS data by incorporating misclassified samples modeling. PLoS One 2020; 15:e0232103. [PMID: 32324812 PMCID: PMC7179850 DOI: 10.1371/journal.pone.0232103] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Accepted: 04/07/2020] [Indexed: 01/14/2023] Open
Abstract
Late-onset Alzheimer’s Disease (LOAD) is the most common form of dementia in the elderly. Genome-wide association studies (GWAS) for LOAD have open new avenues to identify genetic causes and to provide diagnostic tools for early detection. Although several predictive models have been proposed using the few detected GWAS markers, there is still a need for improvement and identification of potential markers. Commonly, polygenic risk scores are being used for prediction. Nevertheless, other methods to generate predictive models have been suggested. In this research, we compared three machine learning methods that have been proved to construct powerful predictive models (genetic algorithms, LASSO, and step-wise) and propose the inclusion of markers from misclassified samples to improve overall prediction accuracy. Our results show that the addition of markers from an initial model plus the markers of the model fitted to misclassified samples improves the area under the receiving operative curve by around 5%, reaching ~0.84, which is highly competitive using only genetic information. The computational strategy used here can help to devise better methods to improve classification models for AD. Our results could have a positive impact on the early diagnosis of Alzheimer’s disease.
Collapse
|
37
|
Comparison of the Efficiency of BLUP and GBLUP in Genomic Prediction of Immune Traits in Chickens. Animals (Basel) 2020; 10:ani10030419. [PMID: 32138151 PMCID: PMC7142406 DOI: 10.3390/ani10030419] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/01/2020] [Accepted: 03/01/2020] [Indexed: 11/17/2022] Open
Abstract
: Poultry diseases pose a large threat to poultry production. Selection to improve immune traits is a feasible way to prevent and control avian diseases. The objective of this study was to investigate the efficiency of estimation of genetic parameters for antibody response to avian influenza virus (Ab-AIV), antibody response to Newcastle disease virus (Ab-NDV), sheep red blood cell antibody titer (SRBC), the ratio of heterophils to lymphocytes (H/L), immunoglobulin G (IgG), the spleen immune index (SII), thymus immune index (TII), thymus weight at 100 d (TW) and the spleen weight at 100 d (SW) in Beijing oil chickens, by using the best linear unbiased prediction (BLUP) method and genomic best linear unbiased prediction (GBLUP) method. The phenotypic data used in the two methods were the same and were from 519 individuals. With the BLUP model, Ab-AIV, Ab-NDV, SRBC, H/L, IgG, TII, and TW had low heritability ranging from 0.000 to 0.281, whereas SII and SW had high heritability of 0.631 and 0.573. With the GBLUP model, all individuals were genotyped with Illumina 60K SNP chips, and Ab-AIV, Ab-NDV, SRBC, H/L and IgG had low heritability ranging from 0.000 to 0.266, whereas SII, TII, TW and SW had moderate heritability ranging from 0.300 to 0.472. We compared the prediction accuracy obtained from BLUP and GBLUP through 50 time 5-fold cross-validation (CV), and the results indicated that BLUP provided a slightly higher accuracy of prediction than GBLUP in this population.
Collapse
|
38
|
A Multiple-Trait Bayesian Lasso for Genome-Enabled Analysis and Prediction of Complex Traits. Genetics 2020; 214:305-331. [PMID: 31879318 PMCID: PMC7017027 DOI: 10.1534/genetics.119.302934] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 12/20/2019] [Indexed: 12/21/2022] Open
Abstract
A multiple-trait Bayesian LASSO (MBL) for genome-based analysis and prediction of quantitative traits is presented and applied to two real data sets. The data-generating model is a multivariate linear Bayesian regression on possibly a huge number of molecular markers, and with a Gaussian residual distribution posed. Each (one per marker) of the [Formula: see text] vectors of regression coefficients (T: number of traits) is assigned the same T-variate Laplace prior distribution, with a null mean vector and unknown scale matrix Σ. The multivariate prior reduces to that of the standard univariate Bayesian LASSO when [Formula: see text] The covariance matrix of the residual distribution is assigned a multivariate Jeffreys prior, and Σ is given an inverse-Wishart prior. The unknown quantities in the model are learned using a Markov chain Monte Carlo sampling scheme constructed using a scale-mixture of normal distributions representation. MBL is demonstrated in a bivariate context employing two publicly available data sets using a bivariate genomic best linear unbiased prediction model (GBLUP) for benchmarking results. The first data set is one where wheat grain yields in two different environments are treated as distinct traits. The second data set comes from genotyped Pinus trees, with each individual measured for two traits: rust bin and gall volume. In MBL, the bivariate marker effects are shrunk differentially, i.e., "short" vectors are more strongly shrunk toward the origin than in GBLUP; conversely, "long" vectors are shrunk less. A predictive comparison was carried out as well in wheat, where the comparators of MBL were bivariate GBLUP and bivariate Bayes Cπ-a variable selection procedure. A training-testing layout was used, with 100 random reconstructions of training and testing sets. For the wheat data, all methods produced similar predictions. In Pinus, MBL gave better predictions that either a Bayesian bivariate GBLUP or the single trait Bayesian LASSO. MBL has been implemented in the Julia language package JWAS, and is now available for the scientific community to explore with different traits, species, and environments. It is well known that there is no universally best prediction machine, and MBL represents a new resource in the armamentarium for genome-enabled analysis and prediction of complex traits.
Collapse
|
39
|
Performance of pedigree and various forms of marker-derived relationship coefficients in genomic prediction and their correlations. J Anim Breed Genet 2020; 137:423-437. [PMID: 32003127 DOI: 10.1111/jbg.12467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 12/08/2019] [Accepted: 12/29/2019] [Indexed: 11/27/2022]
Abstract
In recent years, with development and validation of different genotyping panels, several methods have been proposed to build efficient similarity matrices among individuals to be used for genomic selection. Consequently, the estimated genetic parameters from such information may deviate from their counterpart using traditional family information. In this study, we used a pedigree-based numerator relationship matrix (A) and three types of marker-based relationship matrices ( G ) including two identical by descent, that is G K and G M and one identical by state, G V as well as four Gaussian kernel ( GK ) similarity kernels with different smoothing parameters to predict yet to be observed phenotypes. Also, we used different kinship matrices that are a linear combination of marker-derived IBD or IBS matrices with A, constructed as K = λ G + 1 - λ A , where the weight ( λ ) assigned to each source of information varied over a grid of values. A Bayesian multiple-trait Gaussian model was fitted to estimate the genetic parameters and compare the prediction accuracy in terms of predictive correlation, mean square error and unbiasedness. Results show that the estimated genetic parameters (heritability and correlations) are affected by the source of the information used to create kinship or the weight placed on the sources of genomic and pedigree information. The superiority of GK-based model depends on the smoothing parameters (θ) so that with an optimum θ value, the GK-based model statistically yielded better performance (higher predictive correlation, lowest MSE and unbiased estimates) and more stable correlations and heritability than the model with IBD, IBS or A kinship matrices or any of the linear combinations.
Collapse
|
40
|
Influence of Genetic Interactions on Polygenic Prediction. G3 (BETHESDA, MD.) 2020; 10:109-115. [PMID: 31649046 PMCID: PMC6945032 DOI: 10.1534/g3.119.400812] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 10/23/2019] [Indexed: 01/01/2023]
Abstract
Prediction of phenotypes from genotypes is an important objective to fulfill the promises of genomics, precision medicine and agriculture. Although it's now possible to account for the majority of genetic variation through model fitting, prediction of phenotypes remains a challenge, especially across populations that have diverged in the past. In this study, we designed simulation experiments to specifically investigate the role of genetic interactions in failure of polygenic prediction. We found that non-additive genetic interactions can significantly reduce the accuracy of polygenic prediction. Our study demonstrated the importance of considering genetic interactions in genetic prediction.
Collapse
|
41
|
Optimizing genomic selection for blight resistance in American chestnut backcross populations: A trade-off with American chestnut ancestry implies resistance is polygenic. Evol Appl 2020; 13:31-47. [PMID: 31892942 PMCID: PMC6935594 DOI: 10.1111/eva.12886] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 09/27/2019] [Accepted: 10/02/2019] [Indexed: 01/04/2023] Open
Abstract
American chestnut was once a foundation species of eastern North American forests, but was rendered functionally extinct in the early 20th century by an exotic fungal blight (Cryphonectria parasitica). Over the past 30 years, the American Chestnut Foundation (TACF) has pursued backcross breeding to generate hybrids that combine the timber-type form of American chestnut with the blight resistance of Chinese chestnut based on a hypothesis of major gene resistance. To accelerate selection within two backcross populations that descended from two Chinese chestnuts, we developed genomic prediction models for five presence/absence blight phenotypes of 1,230 BC3F2 selection candidates and average canker severity of their BC3F3 progeny. We also genotyped pure Chinese and American chestnut reference panels to estimate the proportion of BC3F2 genomes inherited from parent species. We found that genomic prediction from a method that assumes an infinitesimal model of inheritance (HBLUP) has similar accuracy to a method that tends to perform well for traits controlled by major genes (Bayes C). Furthermore, the proportion of BC3F2 trees' genomes inherited from American chestnut was negatively correlated with the blight resistance of these trees and their progeny. On average, selected BC3F2 trees inherited 83% of their genome from American chestnut and have blight resistance that is intermediate between F1 hybrids and American chestnut. Results suggest polygenic inheritance of blight resistance. The blight resistance of restoration populations will be enhanced through recurrent selection, by advancing additional sources of resistance through fewer backcross generations, and by potentially by breeding with transgenic blight-tolerant trees.
Collapse
|
42
|
SNP and Haplotype-Based Genomic Selection of Quantitative Traits in Eucalyptus globulus. PLANTS 2019; 8:plants8090331. [PMID: 31492041 PMCID: PMC6783840 DOI: 10.3390/plants8090331] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 09/02/2019] [Accepted: 09/03/2019] [Indexed: 01/02/2023]
Abstract
Eucalyptus globulus (Labill.) is one of the most important cultivated eucalypts in temperate and subtropical regions and has been successfully subjected to intensive breeding. In this study, Bayesian genomic models that include the effects of haplotype and single nucleotide polymorphisms (SNP) were assessed to predict quantitative traits related to wood quality and tree growth in a 6-year-old breeding population. To this end, the following markers were considered: (a) ~14 K SNP markers (SNP), (b) ~3 K haplotypes (HAP), and (c) haplotypes and SNPs that were not assigned to a haplotype (HAP-SNP). Predictive ability values (PA) were dependent on the genomic prediction models and markers. On average, Bayesian ridge regression (BRR) and Bayes C had the highest PA for the majority of traits. Notably, genomic models that included the haplotype effect (either HAP or HAP-SNP) significantly increased the PA of low-heritability traits. For instance, BRR based on HAP had the highest PA (0.58) for stem straightness. Consistently, the heritability estimates from genomic models were higher than the pedigree-based estimates for these traits. The results provide additional perspectives for the implementation of genomic selection in Eucalyptus breeding programs, which could be especially beneficial for improving traits with low heritability.
Collapse
|
43
|
Abstract
Motivation Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens. Results We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log3N, log3M)). We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min. Availability and implementation The RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg.
Collapse
|
44
|
Abstract
Genetics provides two major opportunities for understanding human disease-as a transformative line of etiological inquiry and as a biomarker for heritable diseases. In psychiatry, biomarkers are very much needed for both research and treatment, given the heterogenous populations identified by current phenomenologically based diagnostic systems. To date, however, useful and valid biomarkers have been scant owing to the inaccessibility and complexity of human brain tissue and consequent lack of insight into disease mechanisms. Genetic biomarkers are therefore especially promising for psychiatric disorders. Genome-wide association studies of common diseases have matured over the last decade, generating the knowledge base for increasingly informative individual-level genetic risk prediction. In this review, we discuss fundamental concepts involved in computing genetic risk with current methods, strengths and weaknesses of various approaches, assessments of utility, and applications to various psychiatric disorders and related traits. Although genetic risk prediction has become increasingly straightforward to apply and common in published studies, there are important pitfalls to avoid. At present, the clinical utility of genetic risk prediction is still low; however, there is significant promise for future clinical applications as the ancestral diversity and sample sizes of genome-wide association studies increase. We discuss emerging data and methods aimed at improving the value of genetic risk prediction for disentangling disease mechanisms and stratifying subjects for epidemiological and clinical studies. For all applications, it is absolutely critical that polygenic risk prediction is applied with appropriate methodology and control for confounding to avoid repeating some mistakes of the candidate gene era.
Collapse
|
45
|
Reference Trait Analysis Reveals Correlations Between Gene Expression and Quantitative Traits in Disjoint Samples. Genetics 2019; 212:919-929. [PMID: 31113812 PMCID: PMC6614885 DOI: 10.1534/genetics.118.301865] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2018] [Accepted: 05/14/2019] [Indexed: 12/21/2022] Open
Abstract
Systems genetic analysis of complex traits involves the integrated analysis of genetic, genomic, and disease-related measures. However, these data are often collected separately across multiple study populations, rendering direct correlation of molecular features to complex traits impossible. Recent transcriptome-wide association studies (TWAS) have harnessed gene expression quantitative trait loci (eQTL) to associate unmeasured gene expression with a complex trait in genotyped individuals, but this approach relies primarily on strong eQTL. We propose a simple and powerful alternative strategy for correlating independently obtained sets of complex traits and molecular features. In contrast to TWAS, our approach gains precision by correlating complex traits through a common set of continuous phenotypes instead of genetic predictors, and can identify transcript-trait correlations for which the regulation is not genetic. In our approach, a set of multiple quantitative "reference" traits is measured across all individuals, while measures of the complex trait of interest and transcriptional profiles are obtained in disjoint subsamples. A conventional multivariate statistical method, canonical correlation analysis, is used to relate the reference traits and traits of interest to identify gene expression correlates. We evaluate power and sample size requirements of this methodology, as well as performance relative to other methods, via extensive simulation and analysis of a behavioral genetics experiment in 258 Diversity Outbred mice involving two independent sets of anxiety-related behaviors and hippocampal gene expression. After splitting the data set and hiding one set of anxiety-related traits in half the samples, we identified transcripts correlated with the hidden traits using the other set of anxiety-related traits and exploiting the highest canonical correlation (R = 0.69) between the trait data sets. We demonstrate that this approach outperforms TWAS in identifying associated transcripts. Together, these results demonstrate the validity, reliability, and power of reference trait analysis for identifying relations between complex traits and their molecular substrates.
Collapse
|
46
|
Correlations between relatives: From Mendelian theory to complete genome sequence. Genet Epidemiol 2019; 43:577-591. [PMID: 31045279 PMCID: PMC6559867 DOI: 10.1002/gepi.22206] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 03/04/2019] [Accepted: 03/25/2019] [Indexed: 12/19/2022]
Abstract
It is 100 years since R. A. Fisher proposed that a Mendelian model of genetic variant effects, additive over loci, could explain the patterns of observed phenotypic correlations between relatives. His loci were hypothetical and his model theoretical. It is only about 50 years since the first genetic markers allowed the detection of even variants with major effects on phenotype, and only 20 years since the development of single-nucleotide polymorphism technology provided dense markers over the genome. Then both mappings in defined pedigrees and population-based genome-wide association studies samples allowed the detection of multiple contributing variants of smaller effect. Finally, with methods based on genotypic correlations between individuals, or on allelic associations between loci, the additive heritability contributions of the genome can be estimated from large population samples. In this review we trace, from 1918 to 2018, the analysis of observed phenotypic correlations between relatives to estimate underlying genetic components of traits in human populations. As with studies from 1918 onward, we use height as the example trait where not only data are readily available, but where Fisher's model of large numbers of variants of infinitesimal effect appears to provide a good approximation to reality. However, we also trace the use of phenotypic and genotypic correlations between relatives in mapping causal variants and resolving genetic contributions to more complex human traits. With the availability of DNA sequence data, we can hope to not only estimate the total genetic contribution to a trait, but to resolve effects of individual genetic variants on biological function.
Collapse
|
47
|
Genome-Wide Mapping of Gene-Phenotype Relationships in Experimentally Evolved Populations. Mol Biol Evol 2019; 35:2085-2095. [PMID: 29860403 DOI: 10.1093/molbev/msy113] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Model organisms subjected to sustained experimental evolution often show levels of phenotypic differentiation that dramatically exceed the phenotypic differences observed in natural populations. Genome-wide sequencing of pooled populations then offers the opportunity to make inferences about the genes that are the cause of these phenotypic differences. We tested, through computer simulations, the efficacy of a statistical learning technique called the "fused lasso additive model" (FLAM). We focused on the ability of FLAM to distinguish between genes which are differentiated and directly affect a phenotype from differentiated genes which have no effect on the phenotype. FLAM can separate these two classes of genes even with relatively small samples (10 populations, in total). The efficacy of FLAM is improved with increased number of populations, reduced environmental phenotypic variation, and increased within-treatment among-replicate variation. FLAM was applied to SNP variation measured in both twenty-population and thirty-population studies of Drosophila subjected to selection for age-at-reproduction, to illustrate the application of the method.
Collapse
|
48
|
Estimation of metabolic syndrome heritability in three large populations including full pedigree and genomic information. Hum Genet 2019; 138:739-748. [PMID: 31154530 DOI: 10.1007/s00439-019-02024-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Accepted: 04/29/2019] [Indexed: 01/02/2023]
Abstract
Metabolic syndrome is a complex human disorder characterized by a cluster of conditions (increased blood pressure, hyperglycemia, excessive body fat around the waist, and abnormal cholesterol or triglyceride levels). Any of these conditions increases the risk of serious disorders such as diabetes or cardiovascular disease. Currently, the degree of genetic regulation of this syndrome is under debate and partially unknown. The principal aim of this study was to estimate the genetic component and the common environmental effects in different populations using full pedigree and genomic information. We used three large populations (Gubbio, ARIC, and Ogliastra cohorts) to estimate the heritability of metabolic syndrome. Due to both pedigree and genotyped data, different approaches were applied to summarize relatedness conditions. Linear mixed models (LLM) using average information restricted maximum likelihood (AIREML) algorithm were applied to partition the variances and estimate heritability (h2) and common sib-household effect (c2). Globally, results obtained from pedigree information showed a significant heritability (h2: 0.286 and 0.271 in Gubbio and Ogliastra, respectively), whereas a lower, but still significant heritability was found using SNPs data ([Formula: see text]: 0.167 and 0.254 in ARIC and Ogliastra). The remaining heritability between h2 and [Formula: see text] ranged between 0.031 and 0.237. Finally, the common environmental c2 in Gubbio and Ogliastra were also significant accounting for about 11% of the phenotypic variance. Availability of different kinds of populations and data helped us to better understand what happened when heritability of metabolic syndrome is estimated and account for different possible confounding. Furthermore, the opportunity of comparing different results provided more precise and less biased estimation of heritability.
Collapse
|
49
|
Abstract
In recent years of animal and plant breeding research, genomic selection (GS) became a choice for selection of appropriate candidate for breeding as it significantly contributes to enhance the genetic gain. Various studies related to GS have been carried out in the recent past. These studies were mostly confined to single trait. Although GS methods based on single trait have not performed very well in cases like pleiotropy, missing data and when the trait under study has low heritability. Gradually, some studies were carried out to explore the possibility of methods for GS based on multiple traits in the view of overcoming the above-mentioned problems in the method of single-trait GS (STGS). Currently, multi-trait-based GS methods are getting importance as it exploits the information of correlated structure among response. In this study, we have compared various methods related to STGS, such as stepwise regression, ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian, best linear unbiased prediction, and support vector machine, and multi-trait-based GS methods, such as multivariate regression with covariance estimation, conditional Gaussian graphical models, mixed model, and LASSO. In almost all cases, multi-trait-based methods are found to be more accurate. Based on the results of this study, it may be concluded that multi-trait-based methods have great potential to increase genetic gain as they utilize the correlation among the response variable as extra information, which contributes to estimate breeding value more precisely. This study is a comprehensive review of the methods of GS right from single trait to multiple traits and comparisons among these two classes.
Collapse
|
50
|
Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 2019; 10:1776. [PMID: 30992449 PMCID: PMC6467998 DOI: 10.1038/s41467-019-09718-5] [Citation(s) in RCA: 663] [Impact Index Per Article: 132.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Accepted: 03/25/2019] [Indexed: 01/23/2023] Open
Abstract
Polygenic risk scores (PRS) have shown promise in predicting human complex traits and diseases. Here, we present PRS-CS, a polygenic prediction method that infers posterior effect sizes of single nucleotide polymorphisms (SNPs) using genome-wide association summary statistics and an external linkage disequilibrium (LD) reference panel. PRS-CS utilizes a high-dimensional Bayesian regression framework, and is distinct from previous work by placing a continuous shrinkage (CS) prior on SNP effect sizes, which is robust to varying genetic architectures, provides substantial computational advantages, and enables multivariate modeling of local LD patterns. Simulation studies using data from the UK Biobank show that PRS-CS outperforms existing methods across a wide range of genetic architectures, especially when the training sample size is large. We apply PRS-CS to predict six common complex diseases and six quantitative traits in the Partners HealthCare Biobank, and further demonstrate the improvement of PRS-CS in prediction accuracy over alternative methods.
Collapse
|