1
|
Anglhuber C, Edel C, Pimentel ECG, Emmerling R, Götz KU, Thaller G. Definition of metafounders based on population structure analysis. Genet Sel Evol 2024; 56:43. [PMID: 38844876 PMCID: PMC11536677 DOI: 10.1186/s12711-024-00913-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 05/22/2024] [Indexed: 11/07/2024] Open
Abstract
BACKGROUND Limitations of the concept of identity by descent in the presence of stratification within a breeding population may lead to an incomplete formulation of the conventional numerator relationship matrix ( A ). Combining A with the genomic relationship matrix ( G ) in a single-step approach for genetic evaluation may cause inconsistencies that can be a source of bias in the resulting predictions. The objective of this study was to identify stratification using genomic data and to transfer this information to matrix A , to improve the compatibility of A and G . METHODS Using software to detect population stratification (ADMIXTURE), we developed an iterative approach. First, we identified 2 to 40 strata ( k ) with ADMIXTURE, which we then introduced in a stepwise manner into matrix A , to generate matrixA Γ using the metafounder methodology. Improvements in consistency between matrix G andA Γ were evaluated by regression analysis and through the comparison of the overall mean and mean diagonal values of both matrices. The approach was tested on genotype and pedigree information of European and North American Brown Swiss animals (85,249). Analyses with ADMIXTURE were initially performed on the full set of genotypes (S1). In addition, we used an alternative dataset where we avoided sampling of closely related animals (S2). RESULTS Results of the regression analyses of standard A on G were - 0.489, 0.780 and 0.647 for intercept, slope and fit of the regression. When analysing S1 data results of the regression forA Γ on G corresponding values were - 0.028, 1.087 and 0.807 for k =7, while there was no clear optimum k . Analyses of S2 gave a clear optimal k =24, with - 0.020, 0.998 and 0.817 as results of the regression. For this k differences in mean and mean diagonal values between both matrices were negligible. CONCLUSIONS The derivation of hidden stratification information based on genotyped animals and its integration into A improved compatibility of the resultingA Γ and G considerably compared to the initial situation. In dairy breeding populations with large half-sib families as sub-structures it is necessary to balance the data when applying population structure analysis to obtain meaningful results.
Collapse
Affiliation(s)
- Christine Anglhuber
- Bavarian State Research Center for Agriculture, Institute for Animal Breeding, Prof. Duerrwaechter Platz 1, 85586, Grub, Germany.
- Institute for Animal Breeding and Husbandry, Christian-Albrechts-Universität, Olshausenstraße 40, 24098, Kiel, Germany.
| | - Christian Edel
- Bavarian State Research Center for Agriculture, Institute for Animal Breeding, Prof. Duerrwaechter Platz 1, 85586, Grub, Germany
| | - Eduardo C G Pimentel
- Bavarian State Research Center for Agriculture, Institute for Animal Breeding, Prof. Duerrwaechter Platz 1, 85586, Grub, Germany
| | - Reiner Emmerling
- Bavarian State Research Center for Agriculture, Institute for Animal Breeding, Prof. Duerrwaechter Platz 1, 85586, Grub, Germany
| | - Kay-Uwe Götz
- Bavarian State Research Center for Agriculture, Institute for Animal Breeding, Prof. Duerrwaechter Platz 1, 85586, Grub, Germany
| | - Georg Thaller
- Institute for Animal Breeding and Husbandry, Christian-Albrechts-Universität, Olshausenstraße 40, 24098, Kiel, Germany
| |
Collapse
|
2
|
Alemu A, Åstrand J, Montesinos-López OA, Isidro Y Sánchez J, Fernández-Gónzalez J, Tadesse W, Vetukuri RR, Carlsson AS, Ceplitis A, Crossa J, Ortiz R, Chawade A. Genomic selection in plant breeding: Key factors shaping two decades of progress. MOLECULAR PLANT 2024; 17:552-578. [PMID: 38475993 DOI: 10.1016/j.molp.2024.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/22/2024] [Accepted: 03/08/2024] [Indexed: 03/14/2024]
Abstract
Genomic selection, the application of genomic prediction (GP) models to select candidate individuals, has significantly advanced in the past two decades, effectively accelerating genetic gains in plant breeding. This article provides a holistic overview of key factors that have influenced GP in plant breeding during this period. We delved into the pivotal roles of training population size and genetic diversity, and their relationship with the breeding population, in determining GP accuracy. Special emphasis was placed on optimizing training population size. We explored its benefits and the associated diminishing returns beyond an optimum size. This was done while considering the balance between resource allocation and maximizing prediction accuracy through current optimization algorithms. The density and distribution of single-nucleotide polymorphisms, level of linkage disequilibrium, genetic complexity, trait heritability, statistical machine-learning methods, and non-additive effects are the other vital factors. Using wheat, maize, and potato as examples, we summarize the effect of these factors on the accuracy of GP for various traits. The search for high accuracy in GP-theoretically reaching one when using the Pearson's correlation as a metric-is an active research area as yet far from optimal for various traits. We hypothesize that with ultra-high sizes of genotypic and phenotypic datasets, effective training population optimization methods and support from other omics approaches (transcriptomics, metabolomics and proteomics) coupled with deep-learning algorithms could overcome the boundaries of current limitations to achieve the highest possible prediction accuracy, making genomic selection an effective tool in plant breeding.
Collapse
Affiliation(s)
- Admas Alemu
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden.
| | - Johanna Åstrand
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden; Lantmännen Lantbruk, Svalöv, Sweden
| | | | - Julio Isidro Y Sánchez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Madrid, Spain
| | - Javier Fernández-Gónzalez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Madrid, Spain
| | - Wuletaw Tadesse
- International Center for Agricultural Research in the Dry Areas (ICARDA), Rabat, Morocco
| | - Ramesh R Vetukuri
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | - Anders S Carlsson
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | | | - José Crossa
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera México-Veracruz, Texcoco, México 52640, Mexico
| | - Rodomiro Ortiz
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden.
| | - Aakash Chawade
- Department of Plant Breeding, Swedish University of Agricultural Sciences, Alnarp, Sweden
| |
Collapse
|
3
|
Bermann M, Legarra A, Munera AA, Misztal I, Lourenco D. Confidence intervals for validation statistics with data truncation in genomic prediction. Genet Sel Evol 2024; 56:18. [PMID: 38459504 PMCID: PMC11234739 DOI: 10.1186/s12711-024-00883-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 01/31/2024] [Indexed: 03/10/2024] Open
Abstract
BACKGROUND Validation by data truncation is a common practice in genetic evaluations because of the interest in predicting the genetic merit of a set of young selection candidates. Two of the most used validation methods in genetic evaluations use a single data partition: predictivity or predictive ability (correlation between pre-adjusted phenotypes and estimated breeding values (EBV) divided by the square root of the heritability) and the linear regression (LR) method (comparison of "early" and "late" EBV). Both methods compare predictions with the whole dataset and a partial dataset that is obtained by removing the information related to a set of validation individuals. EBV obtained with the partial dataset are compared against adjusted phenotypes for the predictivity or EBV obtained with the whole dataset in the LR method. Confidence intervals for predictivity and the LR method can be obtained by replicating the validation for different samples (or folds), or bootstrapping. Analytical confidence intervals would be beneficial to avoid running several validations and to test the quality of the bootstrap intervals. However, analytical confidence intervals are unavailable for predictivity and the LR method. RESULTS We derived standard errors and Wald confidence intervals for the predictivity and statistics included in the LR method (bias, dispersion, ratio of accuracies, and reliability). The confidence intervals for the bias, dispersion, and reliability depend on the relationships and prediction error variances and covariances across the individuals in the validation set. We developed approximations for large datasets that only need the reliabilities of the individuals in the validation set. The confidence intervals for the ratio of accuracies and predictivity were obtained through the Fisher transformation. We show the adequacy of both the analytical and approximated analytical confidence intervals and compare them versus bootstrap confidence intervals using two simulated examples. The analytical confidence intervals were closer to the simulated ones for both examples. Bootstrap confidence intervals tend to be narrower than the simulated ones. The approximated analytical confidence intervals were similar to those obtained by bootstrapping. CONCLUSIONS Estimating the sampling variation of predictivity and the statistics in the LR method without replication or bootstrap is possible for any dataset with the formulas presented in this study.
Collapse
Affiliation(s)
- Matias Bermann
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA.
| | - Andres Legarra
- Council on Dairy Cattle Breeding (CDCB), Bowie, MD, 20716, USA
| | | | - Ignacy Misztal
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA
| | - Daniela Lourenco
- Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA
| |
Collapse
|
4
|
Simiqueli GF, Resende RT, Takahashi EK, de Sousa JE, Grattapaglia D. Realized genomic selection across generations in a reciprocal recurrent selection breeding program of Eucalyptus hybrids. FRONTIERS IN PLANT SCIENCE 2023; 14:1252504. [PMID: 37965018 PMCID: PMC10641691 DOI: 10.3389/fpls.2023.1252504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/29/2023] [Indexed: 11/16/2023]
Abstract
Introduction Genomic selection (GS) experiments in forest trees have largely reported estimates of predictive abilities from cross-validation among individuals in the same breeding generation. In such conditions, no effects of recombination, selection, drift, and environmental changes are accounted for. Here, we assessed the effectively realized predictive ability (RPA) for volume growth at harvest age by GS across generations in an operational reciprocal recurrent selection (RRS) program of hybrid Eucalyptus. Methods Genomic best linear unbiased prediction with additive (GBLUP_G), additive plus dominance (GBLUP_G+D), and additive single-step (HBLUP) models were trained with different combinations of growth data of hybrids and pure species individuals (N = 17,462) of the G1 generation, 1,944 of which were genotyped with ~16,000 SNPs from SNP arrays. The hybrid G2 progeny trial (HPT267) was the GS target, with 1,400 selection candidates, 197 of which were genotyped still at the seedling stage, and genomically predicted for their breeding and genotypic values at the operational harvest age (6 years). Seedlings were then grown to harvest and measured, and their pedigree-based breeding and genotypic values were compared to their originally predicted genomic counterparts. Results Genomic RPAs ≥0.80 were obtained as the genetic relatedness between G1 and G2 increased, especially when the direct parents of selection candidates were used in training. GBLUP_G+D reached RPAs ≥0.70 only when hybrid or pure species data of G1 were included in training. HBLUP was only marginally better than GBLUP. Correlations ≥0.80 were obtained between pedigree and genomic individual ranks. Rank coincidence of the top 2.5% selections was the highest for GBLUP_G (45% to 60%) compared to GBLUP_G+D. To advance the pure species RRS populations, GS models were best when trained on pure species than hybrid data, and HBLUP yielded ~20% higher predictive abilities than GBLUP, but was not better than ABLUP for ungenotyped trees. Discussion We demonstrate that genomic data effectively enable accurate ranking of eucalypt hybrid seedlings for their yet-to-be observed volume growth at harvest age. Our results support a two-stage GS approach involving family selection by average genomic breeding value, followed by within-top-families individual GS, significantly increasing selection intensity, optimizing genotyping costs, and accelerating RRS breeding.
Collapse
Affiliation(s)
| | - Rafael Tassinari Resende
- School of Agronomy, Federal University of Goiás (UFG), Goiânia, GO, Brazil
- Department of Forestry, University of Brasília (UnB), Brasília, DF, Brazil
| | | | | | - Dario Grattapaglia
- Plant Genetics Laboratory, EMBRAPA Genetic Resources and Biotechnology, Brasilia, Brazil
| |
Collapse
|