1
|
Zhou H, McPeek MS. Overcoming the "feast or famine" effect: improved interaction testing in genome-wide association studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.02.13.580168. [PMID: 38405994 PMCID: PMC10888770 DOI: 10.1101/2024.02.13.580168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
In genetic association analysis of complex traits, detection of interaction (either GxG or GxE) can help to elucidate the genetic architecture and biological mechanisms underlying the trait. Detection of interaction in a genome-wide interaction study (GWIS) can be methodologically challenging for various reasons, including a high burden of multiple comparisons when testing for epistasis between all possible pairs of a set of genomewide variants, as well as heteroscedasticity effects occurring in the presence of GxG or GxE interaction. In this paper, we address the problem of an even more striking phenomenon that we call the "feast or famine" effect that occurs when testing interaction in a genomewide context. We show that in any given GxE GWIS, the type 1 error of standard interaction tests performed genomewide can vary widely from the nominal level, where the actual type 1 error in any given GWIS varies as a predictable function of the observed trait and environmental values. Using standard methods, some GWISs will have systematically underinflated p-values ("feast"), and others will have systematically overinflated p-values ("famine"), which can lead to false detection of interaction, reduced power, inconsistent results across studies, and failure to replicate true signal. This startling phenomenon is specific to detection of interaction in a GWIS, and it may partly explain why such detection has often proved challenging and difficult to replicate. We show that the feast or famine effect occurs across a wide range of GxE analysis methods, including but not limited to (1) testing interaction in a linear or linear mixed model (LMM) using standard approaches such as t-tests/Wald tests, likelihood ratio tests, or score tests; (2) doing a combined interaction-association test in a linear model or LMM using standard approaches; (3) testing interaction with multiple environments or multiple SNPs, where these are modeled as random effects in a LMM using standard approaches; (4) performing tests of interaction in a GWIS where significance is assessed using permutation of the trait residuals. We show theoretically that the key cause of this phenomenon is which variables are conditioned on in the analysis. Using this insight, we have developed (i) a diagnostic ratio to detect which GWASs are subject to a strong "feast or famine" effect and (ii) the TINGA method to adjust the interaction test statistics to make their p-values approximately uniform under the null hypothesis. In simulations we show that TINGA both controls type 1 error and improves power. TINGA allows for covariates and population structure through use of a linear mixed model and accounts for heteroscedasticity. We apply TINGA to detection of epistasis in a study of flowering time in Arabidopsis thaliana.
Collapse
Affiliation(s)
- Huanlin Zhou
- Department of Statistics, The University of Chicago, Chicago, Illinois, U.S.A
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, Illinois, U.S.A
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, U.S.A
| |
Collapse
|
2
|
Washburn JD, Varela JI, Xavier A, Chen Q, Ertl D, Gage JL, Holland JB, Lima DC, Romay MC, Lopez-Cruz M, de los Campos G, Barber W, Zimmer C, Trucillo Silva I, Rocha F, Rincent R, Ali B, Hu H, Runcie DE, Gusev K, Slabodkin A, Bax P, Aubert J, Gangloff H, Mary-Huard T, Vanrenterghem T, Quesada-Traver C, Yates S, Ariza-Suárez D, Ulrich A, Wyler M, Kick DR, Bellis ES, Causey JL, Soriano Chavez E, Wang Y, Piyush V, Fernando GD, Hu RK, Kumar R, Timon AJ, Venkatesh R, Segura Abá K, Chen H, Ranaweera T, Shiu SH, Wang P, Gordon MJ, Amos BK, Busato S, Perondi D, Gogna A, Psaroudakis D, Chen CPJ, Al-Mamun HA, Danilevicz MF, Upadhyaya SR, Edwards D, de Leon N. Global genotype by environment prediction competition reveals that diverse modeling strategies can deliver satisfactory maize yield estimates. Genetics 2025; 229:iyae195. [PMID: 39576009 PMCID: PMC12054733 DOI: 10.1093/genetics/iyae195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 11/13/2024] [Indexed: 11/27/2024] Open
Abstract
Predicting phenotypes from a combination of genetic and environmental factors is a grand challenge of modern biology. Slight improvements in this area have the potential to save lives, improve food and fuel security, permit better care of the planet, and create other positive outcomes. In 2022 and 2023, the first open-to-the-public Genomes to Fields initiative Genotype by Environment prediction competition was held using a large dataset including genomic variation, phenotype and weather measurements, and field management notes gathered by the project over 9 years. The competition attracted registrants from around the world with representation from academic, government, industry, and nonprofit institutions as well as unaffiliated. These participants came from diverse disciplines, including plant science, animal science, breeding, statistics, computational biology, and others. Some participants had no formal genetics or plant-related training, and some were just beginning their graduate education. The teams applied varied methods and strategies, providing a wealth of modeling knowledge based on a common dataset. The winner's strategy involved 2 models combining machine learning and traditional breeding tools: 1 model emphasized environment using features extracted by random forest, ridge regression, and least squares, and 1 focused on genetics. Other high-performing teams' methods included quantitative genetics, machine learning/deep learning, mechanistic models, and model ensembles. The dataset factors used, such as genetics, weather, and management data, were also diverse, demonstrating that no single model or strategy is far superior to all others within the context of this competition.
Collapse
Affiliation(s)
- Jacob D Washburn
- USDA-ARS, MWA-PGRU, 302-A Curtis Hall, University of Missouri, Columbia, MO 65211, USA
| | - José Ignacio Varela
- Department of Plant and Agroecosystem Sciences, University of Wisconsin—Madison, 1575 Linden Drive, Madison, WI 53706, USA
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA 50131, USA
| | - Alencar Xavier
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA 50131, USA
- Department of Agronomy, Purdue University, 915 Mitch Daniels Blvd, West Lafayette, IN 47907, USA
| | - Qiuyue Chen
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - David Ertl
- Iowa Corn Promotion Board, Johnston, IA 50131, USA
| | - Joseph L Gage
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - James B Holland
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695, USA
- USDA-ARS, Plant Science Research Unit, Raleigh, NC 27695, USA
| | - Dayane Cristina Lima
- Department of Plant and Agroecosystem Sciences, University of Wisconsin—Madison, 1575 Linden Drive, Madison, WI 53706, USA
| | - Maria Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA
| | - Marco Lopez-Cruz
- Departments of Epidemiology and Biostatistics and Statistics and Probability, and Institute for Quantitative Health Science and Engineering, Michigan State University, 775 Woodlot Dr, East Lansing, MI 48823, USA
| | - Gustavo de los Campos
- Departments of Epidemiology and Biostatistics and Statistics and Probability, and Institute for Quantitative Health Science and Engineering, Michigan State University, 775 Woodlot Dr, East Lansing, MI 48823, USA
| | - Wesley Barber
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA 50131, USA
| | | | | | - Fabiani Rocha
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA 50131, USA
| | - Renaud Rincent
- Université Paris—Saclay, INRAE, CNRS, AgroParisTech, GQE—Le Moulon, 91190 Gif-sur-Yvette, France
| | - Baber Ali
- Université Paris—Saclay, INRAE, CNRS, AgroParisTech, GQE—Le Moulon, 91190 Gif-sur-Yvette, France
| | - Haixiao Hu
- Department of Plant Sciences, University of California Davis, One Shield Drive, Davis, CA 95616, USA
| | - Daniel E Runcie
- Department of Plant Sciences, University of California Davis, One Shield Drive, Davis, CA 95616, USA
| | - Kirill Gusev
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE 19808, USA
| | - Andrei Slabodkin
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE 19808, USA
| | - Phillip Bax
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE 19808, USA
| | - Julie Aubert
- Université Paris—Saclay, AgroParisTech, INRAE, UMR MIA Paris—Saclay, 91120 Palaiseau, France
| | - Hugo Gangloff
- Université Paris—Saclay, AgroParisTech, INRAE, UMR MIA Paris—Saclay, 91120 Palaiseau, France
| | - Tristan Mary-Huard
- Université Paris—Saclay, INRAE, CNRS, AgroParisTech, GQE—Le Moulon, 91190 Gif-sur-Yvette, France
- Université Paris—Saclay, AgroParisTech, INRAE, UMR MIA Paris—Saclay, 91120 Palaiseau, France
| | - Theodore Vanrenterghem
- Université Paris—Saclay, AgroParisTech, INRAE, UMR MIA Paris—Saclay, 91120 Palaiseau, France
| | - Carles Quesada-Traver
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Steven Yates
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Daniel Ariza-Suárez
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Argeo Ulrich
- Puregene AG, Etzmatt 273, CH-4314 Zeiningen, Switzerland
- Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zürich, Switzerland
| | - Michele Wyler
- MWSchmid GmbH, Hauptstrasse 34, CH-8750 Glarus, Switzerland
| | - Daniel R Kick
- USDA-ARS, MWA-PGRU, 302-A Curtis Hall, University of Missouri, Columbia, MO 65211, USA
| | - Emily S Bellis
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd, Jonesboro, AR 72401, USA
| | - Jason L Causey
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd, Jonesboro, AR 72401, USA
| | - Emilio Soriano Chavez
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd, Jonesboro, AR 72401, USA
| | - Yixing Wang
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd, Jonesboro, AR 72401, USA
| | - Ved Piyush
- Department of Statistics, University of Nebraska—Lincoln, 340 Hardin Hall North Wing, Lincoln, NE 68583, USA
| | - Gayara D Fernando
- Department of Statistics, University of Nebraska—Lincoln, 340 Hardin Hall North Wing, Lincoln, NE 68583, USA
| | - Robert K Hu
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| | - Rachit Kumar
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
- Medical Scientist Training Program, Perelman School of Medicine at the University of Pennsylvania, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA 19104, USA
| | - Annan J Timon
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| | - Rasika Venkatesh
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| | - Kenia Segura Abá
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI 48824, USA
- Genetics and Genome Sciences Graduate Program, Michigan State University, East Lansing, MI 48824, USA
| | - Huan Chen
- Genetics and Genome Sciences Graduate Program, Michigan State University, East Lansing, MI 48824, USA
| | - Thilanka Ranaweera
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI 48824, USA
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Shin-Han Shiu
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI 48824, USA
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Peiran Wang
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27606, USA
| | - Max J Gordon
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27606, USA
| | - B Kirtley Amos
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27606, USA
| | - Sebastiano Busato
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27606, USA
| | - Daniel Perondi
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27606, USA
| | - Abhishek Gogna
- Department of Breeding Research, Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung, Corrensstraße 3, Gatersleben 6466, Germany
| | - Dennis Psaroudakis
- Department of Molecular Genetics, Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung, Corrensstraße 3, Gatersleben 6466, Germany
| | | | - Hawlader A Al-Mamun
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia
| | - Monica F Danilevicz
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia
| | - Shriprabha R Upadhyaya
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia
| | - David Edwards
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia
| | - Natalia de Leon
- Department of Plant and Agroecosystem Sciences, University of Wisconsin—Madison, 1575 Linden Drive, Madison, WI 53706, USA
| |
Collapse
|
3
|
Hu H, Rincent R, Runcie DE. MegaLMM improves genomic predictions in new environments using environmental covariates. Genetics 2025; 229:1-41. [PMID: 39471330 PMCID: PMC11708919 DOI: 10.1093/genetics/iyae171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 09/19/2024] [Accepted: 09/25/2024] [Indexed: 11/01/2024] Open
Abstract
Multienvironment trials (METs) are crucial for identifying varieties that perform well across a target population of environments. However, METs are typically too small to sufficiently represent all relevant environment-types, and face challenges from changing environment-types due to climate change. Statistical methods that enable prediction of variety performance for new environments beyond the METs are needed. We recently developed MegaLMM, a statistical model that can leverage hundreds of trials to significantly improve genetic value prediction accuracy within METs. Here, we extend MegaLMM to enable genomic prediction in new environments by learning regressions of latent factor loadings on Environmental Covariates (ECs) across trials. We evaluated the extended MegaLMM using the maize Genome-To-Fields dataset, consisting of 4,402 varieties cultivated in 195 trials with 87.1% of phenotypic values missing, and demonstrated its high accuracy in genomic prediction under various breeding scenarios. Furthermore, we showcased MegaLMM's superiority over univariate GBLUP in predicting trait performance of experimental genotypes in new environments. Finally, we explored the use of higher-dimensional quantitative ECs and discussed when and how detailed environmental data can be leveraged for genomic prediction from METs. We propose that MegaLMM can be applied to plant breeding of diverse crops and different fields of genetics where large-scale linear mixed models are utilized.
Collapse
Affiliation(s)
- Haixiao Hu
- Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA
| | - Renaud Rincent
- GQE - Le Moulon Université Paris-Saclay, INRAE, CNRS, AgroParisTech, Gif-sur-Yvette 91190, France
| | - Daniel E Runcie
- Department of Plant Sciences, University of California Davis, Davis, CA 95616, USA
| |
Collapse
|
4
|
Washburn JD, Varela JI, Xavier A, Chen Q, Ertl D, Gage JL, Holland JB, Lima DC, Romay MC, Lopez-Cruz M, de los Campos G, Barber W, Zimmer C, Silva IT, Rocha F, Rincent R, Ali B, Hu H, Runcie DE, Gusev K, Slabodkin A, Bax P, Aubert J, Gangloff H, Mary-Huard T, Vanrenterghem T, Quesada-Traver C, Yates S, Ariza-Suárez D, Ulrich A, Wyler M, Kick DR, Bellis ES, Causey JL, Chavez ES, Wang Y, Piyush V, Fernando GD, Hu RK, Kumar R, Timon AJ, Venkatesh R, Abá KS, Chen H, Ranaweera T, Shiu SH, Wang P, Gordon MJ, Amos BK, Busato S, Perondi D, Gogna A, Psaroudakis D, Chen CPJ, Al-Mamun HA, Danilevicz MF, Upadhyaya SR, Edwards D, de Leon N. Global Genotype by Environment Prediction Competition Reveals That Diverse Modeling Strategies Can Deliver Satisfactory Maize Yield Estimates. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.13.612969. [PMID: 39345633 PMCID: PMC11429743 DOI: 10.1101/2024.09.13.612969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Predicting phenotypes from a combination of genetic and environmental factors is a grand challenge of modern biology. Slight improvements in this area have the potential to save lives, improve food and fuel security, permit better care of the planet, and create other positive outcomes. In 2022 and 2023 the first open-to-the-public Genomes to Fields (G2F) initiative Genotype by Environment (GxE) prediction competition was held using a large dataset including genomic variation, phenotype and weather measurements and field management notes, gathered by the project over nine years. The competition attracted registrants from around the world with representation from academic, government, industry, and non-profit institutions as well as unaffiliated. These participants came from diverse disciplines include plant science, animal science, breeding, statistics, computational biology and others. Some participants had no formal genetics or plant-related training, and some were just beginning their graduate education. The teams applied varied methods and strategies, providing a wealth of modeling knowledge based on a common dataset. The winner's strategy involved two models combining machine learning and traditional breeding tools: one model emphasized environment using features extracted by Random Forest, Ridge Regression and Least-squares, and one focused on genetics. Other high-performing teams' methods included quantitative genetics, classical machine learning/deep learning, mechanistic models, and model ensembles. The dataset factors used, such as genetics; weather; and management data, were also diverse, demonstrating that no single model or strategy is far superior to all others within the context of this competition.
Collapse
Affiliation(s)
- Jacob D. Washburn
- USDA-ARS-MWA-PGRU, 302-A Curtis Hall, U. of MO., Columbia, MO, 65211, USA
| | - José Ignacio Varela
- Department of Plant and Agroecosystem Sciences, University of Wisconsin - Madison, 1575 Linden Drive, Madison, WI, 53706, USA
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA
| | - Alencar Xavier
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA
- Department of Agronomy, Purdue University, 915 Mitch Daniels Blvd, West Lafayette, IN 47907, United States
| | - Qiuyue Chen
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC, 27695, USA
| | - David Ertl
- Iowa Corn Promotion Board, Johnston, IA, 50131, USA
| | - Joseph L. Gage
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC, 27695, USA
| | - James B. Holland
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC, 27695, USA
- USDA-ARS Plant Science Research Unit, Raleigh, NC, 27695, USA
| | - Dayane Cristina Lima
- Department of Plant and Agroecosystem Sciences, University of Wisconsin - Madison, 1575 Linden Drive, Madison, WI, 53706, USA
| | - Maria Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, 14853, USA
| | - Marco Lopez-Cruz
- Departments of Epidemiology & Biostatistics and Statistics & Probability, and Institute for Quantitative Health Science and Engineering, Michigan State University, 775 Woodlot Dr., East Lansing, MI, 48823, USA
| | - Gustavo de los Campos
- Departments of Epidemiology & Biostatistics and Statistics & Probability, and Institute for Quantitative Health Science and Engineering, Michigan State University, 775 Woodlot Dr., East Lansing, MI, 48823, USA
| | - Wesley Barber
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA
| | - Cristiano Zimmer
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA
| | | | - Fabiani Rocha
- Corteva Agrisciences, 8305 NW 62nd Ave, Johnston, IA, 50131, USA
| | - Renaud Rincent
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
| | - Baber Ali
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
| | - Haixiao Hu
- Department of Plant Sciences, University of California Davis, One Shield Drive, Davis, CA, 95616, USA
| | - Daniel E Runcie
- Department of Plant Sciences, University of California Davis, One Shield Drive, Davis, CA, 95616, USA
| | - Kirill Gusev
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE, 19808, USA
| | - Andrei Slabodkin
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE, 19808, USA
| | - Phillip Bax
- Smart Agri Labs, 2055 Limestone Rd STE 200-C, Wilmington, DE, 19808, USA
| | - Julie Aubert
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
| | - Hugo Gangloff
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
| | - Tristan Mary-Huard
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
| | - Theodore Vanrenterghem
- Université Paris-Saclay, AgroParisTech, INRAE, UMR MIA Paris-Saclay, 91120, Palaiseau, France
| | - Carles Quesada-Traver
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Steven Yates
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Daniel Ariza-Suárez
- Molecular Plant Breeding, Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zurich, Switzerland
| | - Argeo Ulrich
- Puregene AG, Etzmatt 273, CH-4314 Zeiningen, Switzerland
- Institute of Agricultural Sciences, ETH Zurich, Universitätstrasse 2, CH-8092 Zürich, Switzerland
| | - Michele Wyler
- MWSchmid GmbH, Hauptstrasse 34, CH-8750 Glarus, Switzerland
| | - Daniel R. Kick
- USDA-ARS-MWA-PGRU, 302-A Curtis Hall, U. of MO., Columbia, MO, 65211, USA
| | - Emily S. Bellis
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd., Jonesboro, AR, 72401, USA
| | - Jason L. Causey
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd., Jonesboro, AR, 72401, USA
| | - Emilio Soriano Chavez
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd., Jonesboro, AR, 72401, USA
| | - Yixing Wang
- Department of Computer Science, Arkansas State University, 2105 E. Aggie Rd., Jonesboro, AR, 72401, USA
| | - Ved Piyush
- Department of Statistics, University of Nebraska - Lincoln, 340 Hardin Hall North Wing, Lincoln, NE, 68583, USA
| | - Gayara D. Fernando
- Department of Statistics, University of Nebraska - Lincoln, 340 Hardin Hall North Wing, Lincoln, NE, 68583, USA
| | - Robert K Hu
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA
| | - Rachit Kumar
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA
- Medical Scientist Training Program, Perelman School of Medicine at the University of Pennsylvania, 3400 Civic Center Blvd., Philadelphia, PA, 19104, USA
| | - Annan J. Timon
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA
| | - Rasika Venkatesh
- Genomics and Computational Biology, Perelman School of Medicine at the University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA
| | - Kenia Segura Abá
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI, 48824, USA
- Genetics and Genome Sciences Graduate Program, Michigan State University, East Lansing, MI, 48824, USA
| | - Huan Chen
- Genetics and Genome Sciences Graduate Program, Michigan State University, East Lansing, MI, 48824, USA
| | - Thilanka Ranaweera
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI, 48824, USA
- Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA
| | - Shin-Han Shiu
- DOE Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, MI, 48824, USA
- Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Peiran Wang
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC, 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC, 27606, USA
| | - Max J. Gordon
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC, 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC, 27606, USA
| | - B K. Amos
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC, 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC, 27606, USA
| | - Sebastiano Busato
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC, 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC, 27606, USA
| | - Daniel Perondi
- NC Plant Science Initiative, North Carolina State University, 840 Oval Drive, Raleigh, NC, 27606, USA
- Department of Electrical and Computer Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC, 27606, USA
| | - Abhishek Gogna
- Department of Breeding Research, Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung, Corrensstraße 3, Gatersleben, 6466, Germany
| | - Dennis Psaroudakis
- Department of Molecular Genetics, Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung, Corrensstraße 3, Gatersleben, 6466, Germany
| | - C. P. James Chen
- School of Animal Sciences, Virginia Tech, Blacksburg, VA, 24061, USA
| | - Hawlader A. Al-Mamun
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA, Australia
| | - Monica F. Danilevicz
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA, Australia
| | - Shriprabha R. Upadhyaya
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA, Australia
| | - David Edwards
- School of Biological Sciences and Centre of Applied Bioinformatics, University of Western Australia, Perth, WA, Australia
| | - Natalia de Leon
- Department of Plant and Agroecosystem Sciences, University of Wisconsin - Madison, 1575 Linden Drive, Madison, WI, 53706, USA
| |
Collapse
|
5
|
Fernandes IK, Vieira CC, Dias KOG, Fernandes SB. Using machine learning to combine genetic and environmental data for maize grain yield predictions across multi-environment trials. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2024; 137:189. [PMID: 39044035 PMCID: PMC11266441 DOI: 10.1007/s00122-024-04687-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Accepted: 06/29/2024] [Indexed: 07/25/2024]
Abstract
KEY MESSAGE Incorporating feature-engineered environmental data into machine learning-based genomic prediction models is an efficient approach to indirectly model genotype-by-environment interactions. Complementing phenotypic traits and molecular markers with high-dimensional data such as climate and soil information is becoming a common practice in breeding programs. This study explored new ways to combine non-genetic information in genomic prediction models using machine learning. Using the multi-environment trial data from the Genomes To Fields initiative, different models to predict maize grain yield were adjusted using various inputs: genetic, environmental, or a combination of both, either in an additive (genetic-and-environmental; G+E) or a multiplicative (genotype-by-environment interaction; GEI) manner. When including environmental data, the mean prediction accuracy of machine learning genomic prediction models increased up to 7% over the well-established Factor Analytic Multiplicative Mixed Model among the three cross-validation scenarios evaluated. Moreover, using the G+E model was more advantageous than the GEI model given the superior, or at least comparable, prediction accuracy, the lower usage of computational memory and time, and the flexibility of accounting for interactions by construction. Our results illustrate the flexibility provided by the ML framework, particularly with feature engineering. We show that the feature engineering stage offers a viable option for envirotyping and generates valuable information for machine learning-based genomic prediction models. Furthermore, we verified that the genotype-by-environment interactions may be considered using tree-based approaches without explicitly including interactions in the model. These findings support the growing interest in merging high-dimensional genotypic and environmental data into predictive modeling.
Collapse
Affiliation(s)
- Igor K Fernandes
- Department of Crop, Soil, and Environmental Sciences, Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA
| | - Caio C Vieira
- Department of Crop, Soil, and Environmental Sciences, University of Arkansas, Fayetteville, AR, USA
| | - Kaio O G Dias
- Department of General Biology, Federal University of Viçosa, Viçosa, Brazil
| | - Samuel B Fernandes
- Department of Crop, Soil, and Environmental Sciences, Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA.
| |
Collapse
|
6
|
Lopez-Cruz M, Pérez-Rodríguez P, de los Campos G. A fast algorithm to factorize high-dimensional tensor product matrices used in genetic models. G3 (BETHESDA, MD.) 2024; 14:jkae001. [PMID: 38180089 PMCID: PMC11090460 DOI: 10.1093/g3journal/jkae001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 12/26/2023] [Accepted: 12/28/2023] [Indexed: 01/06/2024]
Abstract
Many genetic models (including models for epistatic effects as well as genetic-by-environment) involve covariance structures that are Hadamard products of lower rank matrices. Implementing these models requires factorizing large Hadamard product matrices. The available algorithms for factorization do not scale well for big data, making the use of some of these models not feasible with large sample sizes. Here, based on properties of Hadamard products and (related) Kronecker products, we propose an algorithm that produces an approximate decomposition that is orders of magnitude faster than the standard eigenvalue decomposition. In this article, we describe the algorithm, show how it can be used to factorize large Hadamard product matrices, present benchmarks, and illustrate the use of the method by presenting an analysis of data from the northern testing locations of the G × E project from the Genomes to Fields Initiative (n ∼ 60,000). We implemented the proposed algorithm in the open-source "tensorEVD" R package.
Collapse
Affiliation(s)
- Marco Lopez-Cruz
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Paulino Pérez-Rodríguez
- Socioeconomía, Estadística e Informática, Colegio de Postgraduados, Montecillos, Edo. de México 56230, Mexico
| | - Gustavo de los Campos
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|