1
|
Hu H, Rincent R, Runcie DE. MegaLMM improves genomic predictions in new environments using environmental covariates. Genetics 2025; 229:1-41. [PMID: 39471330 PMCID: PMC11708919 DOI: 10.1093/genetics/iyae171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 09/19/2024] [Accepted: 09/25/2024] [Indexed: 11/01/2024] Open
Abstract
Multienvironment trials (METs) are crucial for identifying varieties that perform well across a target population of environments. However, METs are typically too small to sufficiently represent all relevant environment-types, and face challenges from changing environment-types due to climate change. Statistical methods that enable prediction of variety performance for new environments beyond the METs are needed. We recently developed MegaLMM, a statistical model that can leverage hundreds of trials to significantly improve genetic value prediction accuracy within METs. Here, we extend MegaLMM to enable genomic prediction in new environments by learning regressions of latent factor loadings on Environmental Covariates (ECs) across trials. We evaluated the extended MegaLMM using the maize Genome-To-Fields dataset, consisting of 4,402 varieties cultivated in 195 trials with 87.1% of phenotypic values missing, and demonstrated its high accuracy in genomic prediction under various breeding scenarios. Furthermore, we showcased MegaLMM's superiority over univariate GBLUP in predicting trait performance of experimental genotypes in new environments. Finally, we explored the use of higher-dimensional quantitative ECs and discussed when and how detailed environmental data can be leveraged for genomic prediction from METs. We propose that MegaLMM can be applied to plant breeding of diverse crops and different fields of genetics where large-scale linear mixed models are utilized.
Collapse
Affiliation(s)
- Haixiao Hu
- Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA
| | - Renaud Rincent
- GQE - Le Moulon Université Paris-Saclay, INRAE, CNRS, AgroParisTech, Gif-sur-Yvette 91190, France
| | - Daniel E Runcie
- Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA
| |
Collapse
|
2
|
Hou K, Xu Z, Ding Y, Mandla R, Shi Z, Boulier K, Harpak A, Pasaniuc B. Calibrated prediction intervals for polygenic scores across diverse contexts. Nat Genet 2024; 56:1386-1396. [PMID: 38886587 PMCID: PMC11465192 DOI: 10.1038/s41588-024-01792-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 05/08/2024] [Indexed: 06/20/2024]
Abstract
Polygenic scores (PGS) have emerged as the tool of choice for genomic prediction in a wide range of fields. We show that PGS performance varies broadly across contexts and biobanks. Contexts such as age, sex and income can impact PGS accuracy with similar magnitudes as genetic ancestry. Here we introduce an approach (CalPred) that models all contexts jointly to produce prediction intervals that vary across contexts to achieve calibration (include the trait with 90% probability), whereas existing methods are miscalibrated. In analyses of 72 traits across large and diverse biobanks (All of Us and UK Biobank), we find that prediction intervals required adjustment by up to 80% for quantitative traits. For disease traits, PGS-based predictions were miscalibrated across socioeconomic contexts such as annual household income levels, further highlighting the need of accounting for context information in PGS-based prediction across diverse populations.
Collapse
Affiliation(s)
- Kangcheng Hou
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA.
| | - Ziqi Xu
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA
| | - Yi Ding
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA
| | - Ravi Mandla
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA
| | - Zhuozheng Shi
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA
| | - Kristin Boulier
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA
| | - Arbel Harpak
- Department of Population Health, The University of Texas at Austin, Austin, TX, USA
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA.
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA.
- Institute for Precision Health, University of California Los Angeles, Los Angeles, CA, USA.
| |
Collapse
|
3
|
Tiezzi F, Goda K, Morgante F. Using lifestyle information in polygenic modeling of blood pressure traits: a simple method to reduce bias. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.05.597631. [PMID: 38895222 PMCID: PMC11185601 DOI: 10.1101/2024.06.05.597631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Complex traits are determined by the effects of multiple genetic variants, multiple environmental factors, and potentially their interaction. Predicting complex trait phenotypes from genotypes is a fundamental task in quantitative genetics that was pioneered in agricultural breeding for selection purposes. However, it has recently become important in human genetics. While prediction accuracy for some human complex traits is appreciable, this remains low for most traits. A promising way to improve prediction accuracy is by including not only genetic information but also environmental information in prediction models. However, environmental factors can, in turn, be genetically determined. This phenomenon gives rise to a correlation between the genetic and environmental components of the phenotype, which violates the assumption of independence between the genetic and environmental components of most statistical methods for polygenic modeling. In this work, we investigated the impact of including 27 lifestyle variables as well as genotype information (and their interaction) for predicting diastolic blood pressure, systolic blood pressure, and pulse pressure in older individuals in UK Biobank. The 27 lifestyle variables were included as either raw variables or adjusted by genetic and other non-genetic factors. The results show that including both lifestyle and genetic data improved prediction accuracy compared to using either piece of information alone. Both prediction accuracy and bias can improve substantially for some traits when the models account for the lifestyle variables after their proper adjustment. Our work confirms the utility of including environmental information in polygenic models of complex traits and highlights the importance of proper handling of the environmental variables.
Collapse
Affiliation(s)
- Francesco Tiezzi
- Department of Agriculture, Food, Environment and Forestry (DAGRI), University of Florence, Florence, Italy
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
| | - Khushi Goda
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| | - Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| |
Collapse
|
4
|
Hou K, Xu Z, Ding Y, Harpak A, Pasaniuc B. Calibrated prediction intervals for polygenic scores across diverse contexts. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.07.24.23293056. [PMID: 37546999 PMCID: PMC10402211 DOI: 10.1101/2023.07.24.23293056] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Polygenic scores (PGS) have emerged as the tool of choice for genomic prediction in a wide range of fields from agriculture to personalized medicine. We analyze data from two large biobanks in the US (All of Us) and the UK (UK Biobank) to find widespread variability in PGS performance across contexts. Many contexts, including age, sex, and income, impact PGS accuracies with similar magnitudes as genetic ancestry. PGSs trained in single versus multi-ancestry cohorts show similar context-specificity in their accuracies. We introduce trait prediction intervals that are allowed to vary across contexts as a principled approach to account for context-specific PGS accuracy in genomic prediction. We model the impact of all contexts in a joint framework to enable PGS-based trait predictions that are well-calibrated (contain the trait value with 90% probability in all contexts), whereas methods that ignore context are mis-calibrated. We show that prediction intervals need to be adjusted for all considered traits ranging from 10% for diastolic blood pressure to 80% for waist circumference. Adjustment of prediction intervals depends on the dataset; for example, prediction intervals for education years need to be adjusted by 90% in All of Us versus 8% in UK Biobank. Our results provide a path forward towards utilization of PGS as a prediction tool across all individuals regardless of their contexts while highlighting the importance of comprehensive profile of context information in study design and data collection.
Collapse
Affiliation(s)
- Kangcheng Hou
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ziqi Xu
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Yi Ding
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
| | - Arbel Harpak
- Department of Population Health, The University of Texas at Austin, Austin, TX, USA
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles
| |
Collapse
|