1
|
Yao Z, Zou W, Zhang X, Nie P, Lv H, Wang W, Zhao X, Yang Y, Yang L. Integrating mid-infrared spectroscopy, machine learning, and graphical bias correction for fatty acid prediction in water buffalo milk. JOURNAL OF THE SCIENCE OF FOOD AND AGRICULTURE 2024. [PMID: 38501395 DOI: 10.1002/jsfa.13471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 02/25/2024] [Accepted: 03/19/2024] [Indexed: 03/20/2024]
Abstract
BACKGROUND Buffalo milk, constituting 15% of global production, has higher fatty acids content than Holstein milk. Fourier-transform mid-infrared (FT-MIR) spectroscopy is widely used for dairy analysis, but its application to buffalo milk, with larger fat globules, remains understudied. The ultimate goal of this study is to develop machine learning models based on FT-MIR spectroscopy for predicting fatty acids in buffalo milk and to assess the accuracy of commercial milk analyzers. This research provides a convenient, fast, and environmentally friendly method for detecting the fatty acid composition in buffalo milk. RESULTS We employed six machine learning algorithms to establish a detection model for 34 fatty acids in buffalo milk. The predictive models demonstrated robust capabilities for high-content fatty acids [C14:0, C15:0, C16:0, C17:0, C18:0, C18:1, saturated fatty acid (SFA), monounsaturated fatty acid (MUFA)], with errors within a 15% range. Traditional FT6000 detection methods exhibited limitations in measuring SFAs and polyunsaturated fatty acids (PUFA). Implementing a mean difference correction of 0.21 for MUFAs and applying regression equations (SFA × 1.0639 + 0.0705; PUFA × 0.5472 + 0.0047) significantly improved measurement accuracy. CONCLUSION This study successfully developed a predictive model for fatty acids in Mediterranean buffalo milk based on FT-MIR spectroscopy. Additionally, a correction was applied to the existing measurement device, FT6000, enabling more accurate measurements of fatty acids in buffalo milk. The findings have practical implications for the food industry, offering a faster and more reliable approach to assess and monitor fatty acid composition in buffalo milk, potentially influencing product development and quality control processes. © 2024 Society of Chemical Industry.
Collapse
Affiliation(s)
- Zhiqiu Yao
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Wenna Zou
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Xinxin Zhang
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Pei Nie
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
- College of Veterinary Medicine, Hunan Agricultural University, Changsha, China
| | - Haimiao Lv
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Wei Wang
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Xuhong Zhao
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Ying Yang
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| | - Liguo Yang
- International Joint Research Center for Animal Genetics, Breeding and Reproduction (IJRCAGBR), Huazhong Agricultural University, Wuhan, China
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Education, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
2
|
Shen J, Li H, Yu X, Bai L, Dong Y, Cao J, Lu K, Tang Z. Efficient feature extraction from highly sparse binary genotype data for cancer prognosis prediction using an auto-encoder. Front Oncol 2023; 12:1091767. [PMID: 36703783 PMCID: PMC9872139 DOI: 10.3389/fonc.2022.1091767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 12/19/2022] [Indexed: 01/11/2023] Open
Abstract
Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is how to integrate highly sparse genetic genomics data with a mass of minor effects into a prediction model for improving prediction power. We find that the deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower-dimensional continuous data in a non-linear way. This may provide benefits in risk prediction-associated genotype data. We developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for cancer prognosis. Specifically, we first reduced the size of binary biomarkers via a univariable regression model to a moderate size. Then, a trainable auto-encoder was used to learn compact features from the reduced data. Next, we performed a LASSO problem process to select the optimal combination of extracted features. Lastly, we applied such feature combination to real cancer prognostic models and evaluated the raw predictive effect of the models. The results indicated that these compressed transformation features could better improve the model's original predictive performance and might avoid an overfitting problem. This idea may be enlightening for everyone involved in cancer research, risk reduction, treatment, and patient care via integrating genomics data.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Huijun Li
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Xinghao Yu
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China,Center for Genetic Epidemiology and Genomics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Jianping Cao
- School of Radiation Medicine and Protection and Collaborative Innovation Center of Radiation Medicine of Jiangsu Higher Education Institutions, Soochow University, Suzhou, China
| | - Ke Lu
- Department of Orthopedics, Affiliated Kunshan Hospital of Jiangsu University, Suzhou, China,*Correspondence: Zaixiang Tang, ; Ke Lu,
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China,*Correspondence: Zaixiang Tang, ; Ke Lu,
| |
Collapse
|
3
|
Sandhu KS, Shiv A, Kaur G, Meena MR, Raja AK, Vengavasi K, Mall AK, Kumar S, Singh PK, Singh J, Hemaprabha G, Pathak AD, Krishnappa G, Kumar S. Integrated Approach in Genomic Selection to Accelerate Genetic Gain in Sugarcane. PLANTS 2022; 11:plants11162139. [PMID: 36015442 PMCID: PMC9412483 DOI: 10.3390/plants11162139] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 08/08/2022] [Accepted: 08/08/2022] [Indexed: 11/30/2022]
Abstract
Marker-assisted selection (MAS) has been widely used in the last few decades in plant breeding programs for the mapping and introgression of genes for economically important traits, which has enabled the development of a number of superior cultivars in different crops. In sugarcane, which is the most important source for sugar and bioethanol, marker development work was initiated long ago; however, marker-assisted breeding in sugarcane has been lagging, mainly due to its large complex genome, high levels of polyploidy and heterozygosity, varied number of chromosomes, and use of low/medium-density markers. Genomic selection (GS) is a proven technology in animal breeding and has recently been incorporated in plant breeding programs. GS is a potential tool for the rapid selection of superior genotypes and accelerating breeding cycle. However, its full potential could be realized by an integrated approach combining high-throughput phenotyping, genotyping, machine learning, and speed breeding with genomic selection. For better understanding of GS integration, we comprehensively discuss the concept of genetic gain through the breeder’s equation, GS methodology, prediction models, current status of GS in sugarcane, challenges of prediction accuracy, challenges of GS in sugarcane, integrated GS, high-throughput phenotyping (HTP), high-throughput genotyping (HTG), machine learning, and speed breeding followed by its prospective applications in sugarcane improvement.
Collapse
Affiliation(s)
- Karansher Singh Sandhu
- Department of Crop and Soil Sciences, Washington State University, Pullman, WA 99163, USA
| | - Aalok Shiv
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Gurleen Kaur
- Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA
| | - Mintu Ram Meena
- Regional Center, ICAR-Sugarcane Breeding Institute, Karnal 132001, India
| | - Arun Kumar Raja
- Division of Crop Production, ICAR-Sugarcane Breeding Institute, Coimbatore 641007, India
| | - Krishnapriya Vengavasi
- Division of Crop Production, ICAR-Sugarcane Breeding Institute, Coimbatore 641007, India
| | - Ashutosh Kumar Mall
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Sanjeev Kumar
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Praveen Kumar Singh
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Jyotsnendra Singh
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Govind Hemaprabha
- Division of Crop Improvement, ICAR-Sugarcane Breeding Institute, Coimbatore 641007, India
| | - Ashwini Dutt Pathak
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
| | - Gopalareddy Krishnappa
- Division of Crop Improvement, ICAR-Sugarcane Breeding Institute, Coimbatore 641007, India
- Correspondence: (G.K.); (S.K.)
| | - Sanjeev Kumar
- Division of Crop Improvement, ICAR-Indian Institute of Sugarcane Research, Lucknow 226002, India
- Correspondence: (G.K.); (S.K.)
| |
Collapse
|
4
|
Islam MS, McCord PH, Olatoye MO, Qin L, Sood S, Lipka AE, Todd JR. Experimental evaluation of genomic selection prediction for rust resistance in sugarcane. THE PLANT GENOME 2021; 14:e20148. [PMID: 34510803 DOI: 10.1002/tpg2.20148] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 07/22/2021] [Indexed: 06/13/2023]
Abstract
The total sugarcane (Saccharum L.) production has increased worldwide; however, the rate of growth is lower compared with other major crops, mainly due to a plateauing of genetic gain. Genomic selection (GS) has proven to substantially increase the rate of genetic gain in many crops. To investigate the utility of GS in future sugarcane breeding, a field trial was conducted using 432 sugarcane clones using an augmented design with two replications. Two major diseases in sugarcane, brown and orange rust (BR and OR), were screened artificially using whorl inoculation method in the field over two crop cycles. The genotypic data were generated through target enrichment sequencing technologies. After filtering, a set of 8,825 single nucleotide polymorphic markers were used to assess the prediction accuracy of multiple GS models. Using fivefold cross-validation, we observed GS prediction accuracies for BR and OR that ranged from 0.28 to 0.43 and 0.13 to 0.29, respectively, across two crop cycles and combined cycles. The prediction ability further improved by including a known major gene for resistance to BR as a fixed effect in the GS model. It also substantially reduced the minimum number of markers and training population size required for GS. The nonparametric GS models outperformed the parametric GS suggesting that nonadditive genetic effects could contribute genomic sources underlying BR and OR. This study demonstrated that GS could potentially predict the genomic estimated breeding value for selecting the desired germplasm for sugarcane breeding for disease resistance.
Collapse
Affiliation(s)
- Md S Islam
- Sugarcane Production Research Unit, USDA-ARS, Canal Point, FL, USA
| | - Per H McCord
- Sugarcane Production Research Unit, USDA-ARS, Canal Point, FL, USA
- Current address: Irrigated Agriculture Research and Extension Center, WA State Univ., Prosser, WA, USA
| | - Marcus O Olatoye
- Dep. of Crop Sciences, Univ. of Illinois, Urbana-Champaign, IL, USA
| | - Lifang Qin
- Sugarcane Production Research Unit, USDA-ARS, Canal Point, FL, USA
- Current address: Guangxi Univ., Nanning, Guangxi, China
| | - Sushma Sood
- Sugarcane Production Research Unit, USDA-ARS, Canal Point, FL, USA
| | | | | |
Collapse
|
5
|
Selection indexes using principal component analysis for reproductive, beef and milk traits in Simmental cattle. Trop Anim Health Prod 2021; 53:378. [PMID: 34185177 DOI: 10.1007/s11250-021-02815-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 06/18/2021] [Indexed: 10/21/2022]
Abstract
Selection indexes in dual-purpose cattle should include beef, milk and reproductive traits. The principal component analysis is a multivariate technique that allows researchers to explore relationships between explanatory variables and traits of interest. The objective of this study was to construct selection indexes for tropical dual-purpose Simmental cattle based on principal components. The evaluated traits were weight at 8 months of age; age at first calving; cumulative first-lactation milk yield at 60, 150, 210 and 305 days; and first calving interval. The selection indexes were estimated as the sum of the products of the estimated breeding values for the seven traits times their respective eigenvectors for the first three principal components. The three selection indexes from principal components analysis generated favourable expected genetic progress for all the traits. However, a selection index with a high expected genetic progress for all traits could not be obtained. The principal component analysis allows breeders to have a selection index that simultaneously improves milk, beef and reproductive traits in dual-purpose Simmental cattle. Because a selection index yielding high expected genetic progress for all traits could not be achieved, the decision to use a specific selection index will depend on the specific conditions of the market, the local needs and the farmer preference.
Collapse
|
6
|
Aono AH, Costa EA, Rody HVS, Nagai JS, Pimenta RJG, Mancini MC, Dos Santos FRC, Pinto LR, Landell MGDA, de Souza AP, Kuroshu RM. Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance. Sci Rep 2020; 10:20057. [PMID: 33208862 PMCID: PMC7676261 DOI: 10.1038/s41598-020-77063-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 08/24/2020] [Indexed: 12/18/2022] Open
Abstract
Sugarcane is an economically important crop, but its genomic complexity has hindered advances in molecular approaches for genetic breeding. New cultivars are released based on the identification of interesting traits, and for sugarcane, brown rust resistance is a desirable characteristic due to the large economic impact of the disease. Although marker-assisted selection for rust resistance has been successful, the genes involved are still unknown, and the associated regions vary among cultivars, thus restricting methodological generalization. We used genotyping by sequencing of full-sib progeny to relate genomic regions with brown rust phenotypes. We established a pipeline to identify reliable SNPs in complex polyploid data, which were used for phenotypic prediction via machine learning. We identified 14,540 SNPs, which led to a mean prediction accuracy of 50% when using different models. We also tested feature selection algorithms to increase predictive accuracy, resulting in a reduced dataset with more explanatory power for rust phenotypes. As a result of this approach, we achieved an accuracy of up to 95% with a dataset of 131 SNPs related to brown rust QTL regions and auxiliary genes. Therefore, our novel strategy has the potential to assist studies of the genomic organization of brown rust resistance in sugarcane.
Collapse
Affiliation(s)
- Alexandre Hild Aono
- Molecular Biology and Genetic Engineering Center (CBMEG), University of Campinas (UNICAMP), Campinas, SP, Brazil
| | - Estela Araujo Costa
- Instituto de Ciência e Tecnologia (ICT), Universidade Federal de São Paulo (UNIFESP), São José dos Campos, SP, Brazil
| | - Hugo Vianna Silva Rody
- Instituto de Ciência e Tecnologia (ICT), Universidade Federal de São Paulo (UNIFESP), São José dos Campos, SP, Brazil
| | - James Shiniti Nagai
- Instituto de Ciência e Tecnologia (ICT), Universidade Federal de São Paulo (UNIFESP), São José dos Campos, SP, Brazil
| | - Ricardo José Gonzaga Pimenta
- Molecular Biology and Genetic Engineering Center (CBMEG), University of Campinas (UNICAMP), Campinas, SP, Brazil
| | - Melina Cristina Mancini
- Molecular Biology and Genetic Engineering Center (CBMEG), University of Campinas (UNICAMP), Campinas, SP, Brazil
| | | | - Luciana Rossini Pinto
- Advanced Center of Sugarcane Agrobusiness Technological Research, Agronomic Institute of Campinas (IAC), Ribeirão Preto, SP, Brazil
| | | | - Anete Pereira de Souza
- Molecular Biology and Genetic Engineering Center (CBMEG), University of Campinas (UNICAMP), Campinas, SP, Brazil.
- Department of Plant Biology, Institute of Biology (IB), University of Campinas (UNICAMP), Campinas, SP, Brazil.
| | - Reginaldo Massanobu Kuroshu
- Instituto de Ciência e Tecnologia (ICT), Universidade Federal de São Paulo (UNIFESP), São José dos Campos, SP, Brazil.
| |
Collapse
|
7
|
Mohino-Herranz I, Gil-Pita R, García-Gómez J, Rosa-Zurera M, Seoane F. A Wrapper Feature Selection Algorithm: An Emotional Assessment Using Physiological Recordings from Wearable Sensors. SENSORS 2020; 20:s20010309. [PMID: 31935893 PMCID: PMC6983098 DOI: 10.3390/s20010309] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2019] [Revised: 12/29/2019] [Accepted: 01/03/2020] [Indexed: 11/16/2022]
Abstract
Assessing emotional state is an emerging application field boosting research activities on the topic of analysis of non-invasive biosignals to find effective markers to accurately determine the emotional state in real-time. Nowadays using wearable sensors, electrocardiogram and thoracic impedance measurements can be recorded, facilitating analyzing cardiac and respiratory functions directly and autonomic nervous system function indirectly. Such analysis allows distinguishing between different emotional states: neutral, sadness, and disgust. This work was specifically focused on the proposal of a k-fold approach for selecting features while training the classifier that reduces the loss of generalization. The performance of the proposed algorithm used as the selection criterion was compared to the commonly used standard error function. The proposed k-fold approach outperforms the conventional method with 4% hit success rate improvement, reaching an accuracy near to 78%. Moreover, the proposed selection criterion method allows the classifier to produce the best performance using a lower number of features at lower computational cost. A reduced number of features reduces the risk of overfitting while a lower computational cost contributes to implementing real-time systems using wearable electronics.
Collapse
Affiliation(s)
- Inma Mohino-Herranz
- Department of Signal Theory and Communications, University of Alcalá, Alcalá de Henares, 28805 Madrid, Spain; (R.G.-P.); (J.G.-G.); (M.R.-Z.)
- Correspondence:
| | - Roberto Gil-Pita
- Department of Signal Theory and Communications, University of Alcalá, Alcalá de Henares, 28805 Madrid, Spain; (R.G.-P.); (J.G.-G.); (M.R.-Z.)
| | - Joaquín García-Gómez
- Department of Signal Theory and Communications, University of Alcalá, Alcalá de Henares, 28805 Madrid, Spain; (R.G.-P.); (J.G.-G.); (M.R.-Z.)
| | - Manuel Rosa-Zurera
- Department of Signal Theory and Communications, University of Alcalá, Alcalá de Henares, 28805 Madrid, Spain; (R.G.-P.); (J.G.-G.); (M.R.-Z.)
| | - Fernando Seoane
- Institute for Clinical Science, Intervention and Technology, Karolinska Institutet, 17177 Solna Stockholm, Sweden;
- Department of Medical Care Technology, Karolinska University Hospital, 14157 Huddinge, Sweden
- Textile Materials Technology, Department of Textile Technology, Faculty of Textiles, Engineering and Businees Swedish School of Textiles, University of Boras, 50190 Boras, Sweden
| |
Collapse
|
8
|
Monteverde E, Gutierrez L, Blanco P, Pérez de Vida F, Rosas JE, Bonnecarrère V, Quero G, McCouch S. Integrating Molecular Markers and Environmental Covariates To Interpret Genotype by Environment Interaction in Rice ( Oryza sativa L.) Grown in Subtropical Areas. G3 (BETHESDA, MD.) 2019; 9:1519-1531. [PMID: 30877079 PMCID: PMC6505146 DOI: 10.1534/g3.119.400064] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 03/05/2019] [Indexed: 01/11/2023]
Abstract
Understanding the genetic and environmental basis of genotype × environment interaction (G×E) is of fundamental importance in plant breeding. If we consider G×E in the context of genotype × year interactions (G×Y), predicting which lines will have stable and superior performance across years is an important challenge for breeders. A better understanding of the factors that contribute to the overall grain yield and quality of rice (Oryza sativa L.) will lay the foundation for developing new breeding and selection strategies for combining high quality, with high yield. In this study, we used molecular marker data and environmental covariates (EC) simultaneously to predict rice yield, milling quality traits and plant height in untested environments (years), using both reaction norm models and partial least squares (PLS), in two rice breeding populations (indica and tropical japonica). We also sought to explain G×E by differential quantitative trait loci (QTL) expression in relation to EC. Our results showed that PLS models trained with both molecular markers and EC gave better prediction accuracies than reaction norm models when predicting future years. We also detected milling quality QTL that showed a differential expression conditional on humidity and solar radiation, providing insight for the main environmental factors affecting milling quality in subtropical and temperate rice growing areas.
Collapse
Affiliation(s)
- Eliana Monteverde
- Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell University, Ithaca NY 14853
| | - Lucía Gutierrez
- Department of Agronomy, University of Wisconsin - Madison WI 53706
| | - Pedro Blanco
- Programa Nacional de Investigación en arroz, Instituto Nacional de Investigación Agropecuaria (INIA), INIA Treinta y Tres 33000, Uruguay
| | - Fernando Pérez de Vida
- Programa Nacional de Investigación en arroz, Instituto Nacional de Investigación Agropecuaria (INIA), INIA Treinta y Tres 33000, Uruguay
| | - Juan E Rosas
- Programa Nacional de Investigación en arroz, Instituto Nacional de Investigación Agropecuaria (INIA), INIA Treinta y Tres 33000, Uruguay
- Unidad de Biotecnología, Instituto Nacional de Investigación Agropecuaria (INIA), Estación Experimental Wilson Ferreira Aldunate 90200, Uruguay
| | - Victoria Bonnecarrère
- Unidad de Biotecnología, Instituto Nacional de Investigación Agropecuaria (INIA), Estación Experimental Wilson Ferreira Aldunate 90200, Uruguay
| | - Gastón Quero
- Department of Plant Biology, College of Agriculture, Universidad de la República, Montevideo, Uruguay
| | - Susan McCouch
- Plant Breeding and Genetics Section, School of Integrative Plant Science, Cornell University, Ithaca NY 14853
| |
Collapse
|
9
|
Sun S, Miao Z, Ratcliffe B, Campbell P, Pasch B, El-Kassaby YA, Balasundaram B, Chen C. SNP variable selection by generalized graph domination. PLoS One 2019; 14:e0203242. [PMID: 30677030 PMCID: PMC6345469 DOI: 10.1371/journal.pone.0203242] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 01/08/2019] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. METHODS AND FINDINGS K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).
Collapse
Affiliation(s)
- Shuzhen Sun
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, United States of America
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Zhuqi Miao
- Center for Health Systems Innovation, Oklahoma State University, Stillwater, United States of America
| | - Blaise Ratcliffe
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Polly Campbell
- Department of Integrative Biology, Oklahoma State University, Stillwater, United States of America
- Department of Evolution, Ecology and Organismal Biology, University of California, Riverside, Riverside, United States of America
| | - Bret Pasch
- Department of Biological Sciences, Northern Arizona University, Flagstaff, United States of America
| | - Yousry A. El-Kassaby
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Balabhaskar Balasundaram
- School of Industrial Engineering and Management, Oklahoma State University, Stillwater, United States of America
| | - Charles Chen
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, United States of America
- * E-mail:
| |
Collapse
|
10
|
Genomic Prediction of Growth and Stem Quality Traits in Eucalyptus globulus Labill. at Its Southernmost Distribution Limit in Chile. FORESTS 2018. [DOI: 10.3390/f9120779] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The present study was undertaken to examine the ability of different genomic selection (GS) models to predict growth traits (diameter at breast height, tree height and wood volume), stem straightness and branching quality of Eucalyptus globulus Labill. trees using a genome-wide Single Nucleotide Polymorphism (SNP) chip (60 K), in one of the southernmost progeny trials of the species, close to its southern distribution limit in Chile. The GS methods examined were Ridge Regression-BLUP (RRBLUP), Bayes-A, Bayes-B, Bayesian least absolute shrinkage and selection operator (BLASSO), principal component regression (PCR), supervised PCR and a variant of the RRBLUP method that involves the previous selection of predictor variables (RRBLUP-B). RRBLUP-B and supervised PCR models presented the greatest predictive ability (PA), followed by the PCR method, for most of the traits studied. The highest PA was obtained for the branching quality (~0.7). For the growth traits, the maximum values of PA varied from 0.43 to 0.54, while for stem straightness, the maximum value of PA reached 0.62 (supervised PCR). The study population presented a more extended linkage disequilibrium (LD) than other populations of E. globulus previously studied. The genome-wide LD decayed rapidly within 0.76 Mbp (threshold value of r2 = 0.1). The average LD on all chromosomes was r2 = 0.09. In addition, the 0.15% of total pairs of linked SNPs were in a complete LD (r2 = 1), and the 3% had an r2 value >0.5. Genomic prediction, which is based on the reduction in dimensionality and variable selection may be a promising method, considering the early growth of the trees and the low-to-moderate values of heritability found in the traits evaluated. These findings provide new understanding of how develop novel breeding strategies for tree improvement of E. globulus at its southernmost range limit in Chile, which could represent new opportunities for forest planting that can benefit the local economy.
Collapse
|
11
|
Hosseini-Vardanjani SM, Shariati MM, Moradi Shahrebabak H, Tahmoorespur M. Incorporating Prior Knowledge of Principal Components in Genomic Prediction. Front Genet 2018; 9:289. [PMID: 30116258 PMCID: PMC6082966 DOI: 10.3389/fgene.2018.00289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2017] [Accepted: 07/11/2018] [Indexed: 12/05/2022] Open
Abstract
Genomic prediction using a large number of markers is challenging, due to the curse of dimensionality as well as multicollinearity arising from linkage disequilibrium between markers. Several methods have been proposed to solve these problems such as Principal Component Analysis (PCA) that is commonly used to reduce the dimension of predictor variables by generating orthogonal variables. Usually, the knowledge from PCA is incorporated in genomic prediction, assuming equal variance for the PCs or a variance proportional to the eigenvalues, both treat variances as fixed. Here, three prior distributions including normal, scaled-t and double exponential were assumed for PC effects in a Bayesian framework with a subset of PCs. These developed PCR models (dPCRm) were compared to routine genomic prediction models (RGPM) i.e., ridge and Bayesian ridge regression, BayesA, BayesB, and PC regression with a subset of PCs but PC variances predefined as proportional to the eigenvalues (PCR-Eigen). The performance of methods was compared by simulating a single trait with heritability of 0.25 on a genome consisted of 3,000 SNPs on three chromosomes and QTL numbers of 15, 60, and 105. After 500 generations of random mating as the historical population, a population was isolated and mated for another 15 generations. The generations 8 and 9 of recent population were used as the reference population and the next six generations as validation populations. The accuracy and bias of predictions were evaluated within the reference population, and each of validation populations. The accuracies of dPCRm were similar to RGPM (0.536 to 0.664 vs. 0.542 to 0.671), and higher than the accuracies of PCR-Eigen (0.504 to 0.641) within reference population over different QTL numbers. Decline in accuracies in validation populations were from 0.633 to 0.310, 0.639 to 0.313, and 0.617 to 0.298 using dPCRm, RGPM and PCR-Eigen, respectively. Prediction biases of dPCRm and RGPM were similar and always much less than biases of PCR-Eigen. In conclusion assuming PC variances as random variables via prior specification yielded higher accuracy than PCR-Eigen and same accuracy as RGPM, while fewer predictors were used.
Collapse
Affiliation(s)
| | - Mohammad M. Shariati
- Department of Animal Science, Ferdowsi University of Mashhad, Mashhad, Iran
- *Correspondence: Mohammad M. Shariati
| | - Hossein Moradi Shahrebabak
- Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Tehran, Iran
| | | |
Collapse
|
12
|
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front Genet 2018; 9:237. [PMID: 30023001 PMCID: PMC6039760 DOI: 10.3389/fgene.2018.00237] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Accepted: 06/14/2018] [Indexed: 12/22/2022] Open
Abstract
The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy.
Collapse
Affiliation(s)
- Bo Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia.,Shandong Technology and Business University, School of Computer Science and Technology, YanTai, China.,Shandong Co-Innovation Centre of Future Intelligent Computing, YanTai, China
| | - Nanxi Zhang
- Centre for Applications in Natural Resource Mathematics, University of Queensland, St Lucia, QLD, Australia
| | - You-Gan Wang
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | - Yutao Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia
| |
Collapse
|
13
|
Du C, Wei J, Wang S, Jia Z. Genomic selection using principal component regression. Heredity (Edinb) 2018; 121:12-23. [PMID: 29713089 DOI: 10.1038/s41437-018-0078-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 03/17/2018] [Accepted: 03/21/2018] [Indexed: 01/02/2023] Open
Abstract
Many statistical methods are available for genomic selection (GS) through which genetic values of quantitative traits are predicted for plants and animals using whole-genome SNP data. A large number of predictors with much fewer subjects become a major computational challenge in GS. Principal components regression (PCR) and its derivative, i.e., partial least squares regression (PLSR), provide a solution through dimensionality reduction. In this study, we show that PCR can perform better than PLSR in cross validation. PCR often requires extracting more components to achieve the maximum predictive ability than PLSR and thus may be associated with a higher computational cost. However, application of the HAT method (a strategy of describing the relationship between the fitted and observed response variables with a hat matrix) to PCR circumvents conventional cross validation in testing predictive ability, resulting in substantially improved computational efficiency over PLSR where cross validation is mandatory. Advantages of PCR over PLSR are illustrated with a simulated trait of a hypothetical population and four agronomical traits of a rice population. The benefit of using PCR in genomic selection is further demonstrated in an effort to predict 1000 metabolomic traits and 24,973 transcriptomic traits in the same rice population.
Collapse
Affiliation(s)
- Caroline Du
- Department of Botany and Plant Sciences, University of California, Riverside, CA, 92521, USA
| | - Julong Wei
- Department of Botany and Plant Sciences, University of California, Riverside, CA, 92521, USA.,College of Animal Science and Technology, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Shibo Wang
- Department of Botany and Plant Sciences, University of California, Riverside, CA, 92521, USA
| | - Zhenyu Jia
- Department of Botany and Plant Sciences, University of California, Riverside, CA, 92521, USA.
| |
Collapse
|
14
|
Accuracy of Genomic Prediction in Switchgrass (Panicum virgatum L.) Improved by Accounting for Linkage Disequilibrium. G3-GENES GENOMES GENETICS 2016; 6:1049-62. [PMID: 26869619 PMCID: PMC4825640 DOI: 10.1534/g3.115.024950] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Switchgrass is a relatively high-yielding and environmentally sustainable biomass crop, but further genetic gains in biomass yield must be achieved to make it an economically viable bioenergy feedstock. Genomic selection (GS) is an attractive technology to generate rapid genetic gains in switchgrass, and meet the goals of a substantial displacement of petroleum use with biofuels in the near future. In this study, we empirically assessed prediction procedures for genomic selection in two different populations, consisting of 137 and 110 half-sib families of switchgrass, tested in two locations in the United States for three agronomic traits: dry matter yield, plant height, and heading date. Marker data were produced for the families’ parents by exome capture sequencing, generating up to 141,030 polymorphic markers with available genomic-location and annotation information. We evaluated prediction procedures that varied not only by learning schemes and prediction models, but also by the way the data were preprocessed to account for redundancy in marker information. More complex genomic prediction procedures were generally not significantly more accurate than the simplest procedure, likely due to limited population sizes. Nevertheless, a highly significant gain in prediction accuracy was achieved by transforming the marker data through a marker correlation matrix. Our results suggest that marker-data transformations and, more generally, the account of linkage disequilibrium among markers, offer valuable opportunities for improving prediction procedures in GS. Some of the achieved prediction accuracies should motivate implementation of GS in switchgrass breeding programs.
Collapse
|
15
|
Borel P, Desmarchelier C, Nowicki M, Bott R. Lycopene bioavailability is associated with a combination of genetic variants. Free Radic Biol Med 2015; 83:238-44. [PMID: 25772008 DOI: 10.1016/j.freeradbiomed.2015.02.033] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/13/2015] [Revised: 02/25/2015] [Accepted: 02/26/2015] [Indexed: 10/23/2022]
Abstract
The intake of tomatoes and tomato products, which constitute the main dietary source of the red pigment lycopene (LYC), has been associated with a reduced risk of prostate cancer and cardiovascular disease, suggesting a protective role of this carotenoid. However, LYC bioavailability displays high interindividual variability. This variability may lead to varying biological effects following LYC consumption. Based on recent results obtained with two other carotenoids, we assumed that this variability was due, at least in part, to several single nucleotide polymorphisms (SNPs) in genes involved in LYC and lipid metabolism. Thus, we aimed at identifying a combination of SNPs significantly associated with the variability in LYC bioavailability. In a postprandial study, 33 healthy male volunteers consumed a test meal containing 100g tomato puree, which provided 9.7 mg all-trans LYC. LYC concentrations were measured in plasma chylomicrons (CM) isolated at regular time intervals over 8 h postprandially. For the study 1885 SNPs in 49 candidate genes, i.e., genes assumed to play a role in LYC bioavailability, were selected. Multivariate statistical analysis (partial least squares regression) was used to identify and validate the combination of SNPs most closely associated with postprandial CM LYC response. The postprandial CM LYC response to the meal was notably variable with a CV of 70%. A significant (P=0.037) and validated partial least squares regression model, which included 28 SNPs in 16 genes, explained 72% of the variance in the postprandial CM LYC response. The postprandial CM LYC response was also positively correlated to fasting plasma LYC concentrations (r=0.37, P<0.05). The ability to respond to LYC is explained, at least partly, by a combination of 28 SNPs in 16 genes. Interindividual variability in bioavailability apparently affects the long-term blood LYC status, which could ultimately modulate the biological response following LYC supplementation.
Collapse
Affiliation(s)
- Patrick Borel
- INRA, UMR INRA1260, F-13005, Marseille, France; INSERM, UMR_S 1062, F-13005, Marseille, France; Aix-Marseille Université, NORT, F-13005, Marseille, France.
| | - Charles Desmarchelier
- INRA, UMR INRA1260, F-13005, Marseille, France; INSERM, UMR_S 1062, F-13005, Marseille, France; Aix-Marseille Université, NORT, F-13005, Marseille, France
| | - Marion Nowicki
- INRA, UMR INRA1260, F-13005, Marseille, France; INSERM, UMR_S 1062, F-13005, Marseille, France; Aix-Marseille Université, NORT, F-13005, Marseille, France
| | - Romain Bott
- INRA, UMR INRA1260, F-13005, Marseille, France; INSERM, UMR_S 1062, F-13005, Marseille, France; Aix-Marseille Université, NORT, F-13005, Marseille, France
| |
Collapse
|
16
|
Multiple-breed genomic evaluation by principal component analysis in small size populations. Animal 2015; 9:738-49. [DOI: 10.1017/s1751731114002973] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
17
|
Onogi A, Ideta O, Inoshita Y, Ebana K, Yoshioka T, Yamasaki M, Iwata H. Exploring the areas of applicability of whole-genome prediction methods for Asian rice (Oryza sativa L.). TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2015; 128:41-53. [PMID: 25341369 DOI: 10.1007/s00122-014-2411-y] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 10/03/2014] [Indexed: 05/25/2023]
Abstract
Our simulation results clarify the areas of applicability of nine prediction methods and suggest the factors that affect their accuracy at predicting empirical traits. Whole-genome prediction is used to predict genetic value from genome-wide markers. The choice of method is important for successful prediction. We compared nine methods using empirical data for eight phenological and morphological traits of Asian rice cultivars (Oryza sativa L.) and data simulated from real marker genotype data. The methods were genomic BLUP (GBLUP), reproducing kernel Hilbert spaces regression (RKHS), Lasso, elastic net, random forest (RForest), Bayesian lasso (Blasso), extended Bayesian lasso (EBlasso), weighted Bayesian shrinkage regression (wBSR), and the average of all methods (Ave). The objectives were to evaluate the predictive ability of these methods in a cultivar population, to characterize them by exploring the area of applicability of each method using simulation, and to investigate the causes of their different accuracies for empirical traits. GBLUP was the most accurate for one trait, RKHS and Ave for two, and RForest for three traits. In the simulation, Blasso, EBlasso, and Ave showed stable performance across the simulated scenarios, whereas the other methods, except wBSR, had specific areas of applicability; wBSR performed poorly in most scenarios. For each method, the accuracy ranking for the empirical traits was largely consistent with that in one of the simulated scenarios, suggesting that the simulation conditions reflected the factors that affected the method accuracy for the empirical results. This study will be useful for genomic prediction not only in Asian rice, but also in populations from other crops with relatively small training sets and strong linkage disequilibrium structures.
Collapse
Affiliation(s)
- Akio Onogi
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ku, Tokyo, 113-8657, Japan
| | | | | | | | | | | | | |
Collapse
|
18
|
Dadousis C, Veerkamp RF, Heringstad B, Pszczola M, Calus MPL. A comparison of principal component regression and genomic REML for genomic prediction across populations. Genet Sel Evol 2014; 46:60. [PMID: 25370926 PMCID: PMC4220066 DOI: 10.1186/s12711-014-0060-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2013] [Accepted: 09/08/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genomic prediction faces two main statistical problems: multicollinearity and n ≪ p (many fewer observations than predictor variables). Principal component (PC) analysis is a multivariate statistical method that is often used to address these problems. The objective of this study was to compare the performance of PC regression (PCR) for genomic prediction with that of a commonly used REML model with a genomic relationship matrix (GREML) and to investigate the full potential of PCR for genomic prediction. METHODS The PCR model used either a common or a semi-supervised approach, where PC were selected based either on their eigenvalues (i.e. proportion of variance explained by SNP (single nucleotide polymorphism) genotypes) or on their association with phenotypic variance in the reference population (i.e. the regression sum of squares contribution). Cross-validation within the reference population was used to select the optimum PCR model that minimizes mean squared error. Pre-corrected average daily milk, fat and protein yields of 1609 first lactation Holstein heifers, from Ireland, UK, the Netherlands and Sweden, which were genotyped with 50 k SNPs, were analysed. Each testing subset included animals from only one country, or from only one selection line for the UK. RESULTS In general, accuracies of GREML and PCR were similar but GREML slightly outperformed PCR. Inclusion of genotyping information of validation animals into model training (semi-supervised PCR), did not result in more accurate genomic predictions. The highest achievable PCR accuracies were obtained across a wide range of numbers of PC fitted in the regression (from one to more than 1000), across test populations and traits. Using cross-validation within the reference population to derive the number of PC, yielded substantially lower accuracies than the highest achievable accuracies obtained across all possible numbers of PC. CONCLUSIONS On average, PCR performed only slightly less well than GREML. When the optimal number of PC was determined based on realized accuracy in the testing population, PCR showed a higher potential in terms of achievable accuracy that was not capitalized when PC selection was based on cross-validation. A standard approach for selecting the optimal set of PC in PCR remains a challenge.
Collapse
Affiliation(s)
| | | | | | | | - Mario P L Calus
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Wageningen 6700, , AH, The Netherlands.
| |
Collapse
|
19
|
Azevedo CF, Silva FF, de Resende MDV, Lopes MS, Duijvesteijn N, Guimarães SEF, Lopes PS, Kelly MJ, Viana JMS, Knol EF. Supervised independent component analysis as an alternative method for genomic selection in pigs. J Anim Breed Genet 2014; 131:452-61. [PMID: 25039677 DOI: 10.1111/jbg.12104] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Accepted: 06/05/2014] [Indexed: 11/28/2022]
Abstract
The objective of this work was to evaluate the efficiency of the supervised independent component regression (SICR) method for the estimation of genomic values and the SNP marker effects for boar taint and carcass traits in pigs. The methods were evaluated via the agreement between the predicted genetic values and the corrected phenotypes observed by cross-validation. These values were also compared with other methods generally used for the same purposes, such as RR-BLUP, SPCR, SPLS, ICR, PCR and PLS. The SICR method was found to have the most accurate prediction values.
Collapse
Affiliation(s)
- C F Azevedo
- Departamento de Estatística, Universidade Federal de Viçosa, Viçosa, Brazil
| | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Borel P, Desmarchelier C, Nowicki M, Bott R, Morange S, Lesavre N. Interindividual variability of lutein bioavailability in healthy men: characterization, genetic variants involved, and relation with fasting plasma lutein concentration. Am J Clin Nutr 2014; 100:168-75. [PMID: 24808487 DOI: 10.3945/ajcn.114.085720] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Lutein accumulates in the macula and brain, where it is assumed to play physiologic roles. The bioavailability of lutein is assumed to display a high interindividual variability that has been hypothesized to be attributable, at least partly, to genetic polymorphisms. OBJECTIVES We characterized the interindividual variability in lutein bioavailability in humans, assessed the relation between this variability and the fasting blood lutein concentration, and identified single nucleotide polymorphisms (SNPs) involved in this phenomenon. DESIGN In a randomized, 2-way crossover study, 39 healthy men consumed a meal that contained a lutein supplement or the same meal for which lutein was provided through a tomato puree. The lutein concentration was measured in plasma chylomicrons isolated at regular time intervals over 8 h postprandially. Multivariate statistical analyses were used to identify a combination of SNPs associated with the postprandial chylomicron lutein response (0-8-h area under the curve). A total of 1785 SNPs in 51 candidate genes were selected. RESULTS Postprandial chylomicron lutein responses to meals were very variable (CV of 75% and 137% for the lutein-supplement meal and the meal with tomato-sourced lutein, respectively). Postprandial chylomicron lutein responses measured after the 2 meals were positively correlated (r = 0.68, P < 0.0001) and positively correlated to the fasting plasma lutein concentration (r = 0.51, P < 0.005 for the lutein-supplement-containing meal). A significant (P = 1.9 × 10(-4)) and validated partial least-squares regression model, which included 29 SNPs in 15 genes, explained most of the variance in the postprandial chylomicron lutein response. CONCLUSIONS The ability to respond to lutein appears to be, at least in part, genetically determined. The ability is explained, in large part, by a combination of SNPs in 15 genes related to both lutein and chylomicron metabolism. Finally, our results suggest that the ability to respond to lutein and blood lutein status are related. This trial was registered at clinicaltrials.gov as NCT02100774.
Collapse
Affiliation(s)
- Patrick Borel
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| | - Charles Desmarchelier
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| | - Marion Nowicki
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| | - Romain Bott
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| | - Sophie Morange
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| | - Nathalie Lesavre
- From Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, Marseille, France (PB, CD, MN, and RB); the Institut National de la Recherche Médicale (INSERM), UMR_S 1062, Marseille, France (PB, CD, MN, and RB); Aix Marseille Université, Nutrition Obésité et Risque Thrombotique, Marseille, France (PB, CD, MN, and RB); the Centre d'Investigation Clinique (CIC) Hôpital de la Conception, Marseille, France (SM); and the CIC Hôpital Nord, Marseille, France (NL)
| |
Collapse
|
21
|
Desmarchelier C, Martin JC, Planells R, Gastaldi M, Nowicki M, Goncalves A, Valéro R, Lairon D, Borel P. The postprandial chylomicron triacylglycerol response to dietary fat in healthy male adults is significantly explained by a combination of single nucleotide polymorphisms in genes involved in triacylglycerol metabolism. J Clin Endocrinol Metab 2014; 99:E484-8. [PMID: 24423365 DOI: 10.1210/jc.2013-3962] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
CONTEXT The postprandial chylomicron (CM) triacylglycerol (TG) response to dietary fat, which is positively associated with atherosclerosis and cardiovascular disease risk, displays a high interindividual variability. This is assumed to be due, at least partly, to polymorphisms in genes involved in lipid metabolism. Existing studies have focused on single nucleotide polymorphisms (SNPs), resulting in only a low explained variability. OBJECTIVE We aimed to identify a combination of SNPs associated with the postprandial CM TG response. PARTICIPANTS AND METHODS Thirty-three healthy male volunteers were subjected to 4 standardized fat tolerance test meals (to correct for intraindividual variability) and genotyped using whole-genome microarrays. The plasma CM TG concentration was measured at regular interval times after each meal. The association of SNPs in or near candidate genes (126 genes representing 6225 SNPs) with the postprandial CM TG concentration (0-8 h areas under the curve averaged for the 4 test meals) was assessed by partial least squares regression, a multivariate statistical approach. RESULTS Data obtained allowed us to generate a validated significant model (P = 1.3 × 10(-7)) that included 42 SNPs in 23 genes (ABCA1, APOA1, APOA5, APOB, BET1, CD36, COBLL1, ELOVL5, FRMD5, GPAM, INSIG2, IRS1, LDLR, LIPC, LPL, LYPLAL1, MC4R, NAT2, PARK2, SLC27A5, SLC27A6, TCF7L2, and ZNF664) and explained 88% of the variance. In 39 of these SNPs, univariate analysis showed that subjects with different genotypes exhibited significantly different (q < .05) postprandial CM TG responses. CONCLUSIONS Using a multivariate approach, we report a combination of SNPs that explains a significant part of the variability in the postprandial CM TG response.
Collapse
Affiliation(s)
- Charles Desmarchelier
- Institut National de la Recherche Agronomique (INRA), Unité Mixte de Recherche (UMR) INRA1260, F-13005, Marseille, France; Inserm, UMR_S 1062, F-13005, Marseille, France; and Aix-Marseille Université, Nutrition, Obésité et Risques Thrombotiques, F-13005, Marseille, France
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Smaragdov MG. Genomic selection of milk cattle. The practical application over five years. RUSS J GENET+ 2013. [DOI: 10.1134/s1022795413100104] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
23
|
Enlarging a training set for genomic selection by imputation of un-genotyped animals in populations of varying genetic architecture. Genet Sel Evol 2013; 45:12. [PMID: 23621897 PMCID: PMC3652763 DOI: 10.1186/1297-9686-45-12] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Accepted: 03/24/2013] [Indexed: 02/02/2023] Open
Abstract
Background The most common application of imputation is to infer genotypes of a high-density panel of markers on animals that are genotyped for a low-density panel. However, the increase in accuracy of genomic predictions resulting from an increase in the number of markers tends to reach a plateau beyond a certain density. Another application of imputation is to increase the size of the training set with un-genotyped animals. This strategy can be particularly successful when a set of closely related individuals are genotyped. Methods Imputation on completely un-genotyped dams was performed using known genotypes from the sire of each dam, one offspring and the offspring’s sire. Two methods were applied based on either allele or haplotype frequencies to infer genotypes at ambiguous loci. Results of these methods and of two available software packages were compared. Quality of imputation under different population structures was assessed. The impact of using imputed dams to enlarge training sets on the accuracy of genomic predictions was evaluated for different populations, heritabilities and sizes of training sets. Results Imputation accuracy ranged from 0.52 to 0.93 depending on the population structure and the method used. The method that used allele frequencies performed better than the method based on haplotype frequencies. Accuracy of imputation was higher for populations with higher levels of linkage disequilibrium and with larger proportions of markers with more extreme allele frequencies. Inclusion of imputed dams in the training set increased the accuracy of genomic predictions. Gains in accuracy ranged from close to zero to 37.14%, depending on the simulated scenario. Generally, the larger the accuracy already obtained with the genotyped training set, the lower the increase in accuracy achieved by adding imputed dams. Conclusions Whenever a reference population resembling the family configuration considered here is available, imputation can be used to achieve an extra increase in accuracy of genomic predictions by enlarging the training set with completely un-genotyped dams. This strategy was shown to be particularly useful for populations with lower levels of linkage disequilibrium, for genomic selection on traits with low heritability, and for species or breeds for which the size of the reference population is limited.
Collapse
|
24
|
de Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 2013; 193:327-45. [PMID: 22745228 PMCID: PMC3567727 DOI: 10.1534/genetics.112.143313] [Citation(s) in RCA: 471] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2012] [Accepted: 06/11/2012] [Indexed: 11/18/2022] Open
Abstract
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Methods exist that allow implementing these large-p with small-n regressions, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of available methods is long, and the relationships between them have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics that emerge in applications, and present a general discussion of lessons learned from simulation and empirical data analysis in the last decade.
Collapse
Affiliation(s)
- Gustavo de Los Campos
- Department of Biostatistics, School of Public Health, University of Alabama, Birmingham, AL 35294, USA.
| | | | | | | | | |
Collapse
|
25
|
Gaspa G, Pintus MA, Nicolazzi EL, Vicario D, Valentini A, Dimauro C, Macciotta NPP. Use of principal component approach to predict direct genomic breeding values for beef traits in Italian Simmental cattle1. J Anim Sci 2013; 91:29-37. [DOI: 10.2527/jas.2011-5061] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- G. Gaspa
- Dipartimento di Agraria, Università di Sassari, Viale Italia 39, 07100, Sassari, Italy
| | - M. A. Pintus
- Dipartimento di Agraria, Università di Sassari, Viale Italia 39, 07100, Sassari, Italy
| | - E. L. Nicolazzi
- Istituto di Zootecnica, Università Cattolica del Sacro Cuore, Piacenza, Italy, 29100
- CRSA—Consorzio di Ricerca e Sperimentazione degli Allevatori—Via G. Tomassetti 9, Rome 00161, Italy
| | - D. Vicario
- Associazione Nazionale Allevatori Razza Pezzata Rossa Italiana (ANAPRI), Udine, Italy, 33100
| | - A. Valentini
- Dipartimento per la innovazione nei sistemi biologici, agroalimentari e forestali (DIBAF), Universita della Tuscia, via de lellis, 01100 Viterbo, Italy
| | - C. Dimauro
- Dipartimento di Agraria, Università di Sassari, Viale Italia 39, 07100, Sassari, Italy
| | - N. P. P. Macciotta
- Dipartimento di Agraria, Università di Sassari, Viale Italia 39, 07100, Sassari, Italy
| |
Collapse
|
26
|
Colombani C, Legarra A, Fritz S, Guillaume F, Croiseau P, Ducrocq V, Robert-Granié C. Application of Bayesian least absolute shrinkage and selection operator (LASSO) and BayesCπ methods for genomic selection in French Holstein and Montbéliarde breeds. J Dairy Sci 2012; 96:575-91. [PMID: 23127905 DOI: 10.3168/jds.2011-5225] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2011] [Accepted: 09/14/2012] [Indexed: 11/19/2022]
Abstract
Recently, the amount of available single nucleotide polymorphism (SNP) marker data has considerably increased in dairy cattle breeds, both for research purposes and for application in commercial breeding and selection programs. Bayesian methods are currently used in the genomic evaluation of dairy cattle to handle very large sets of explanatory variables with a limited number of observations. In this study, we applied 2 bayesian methods, BayesCπ and bayesian least absolute shrinkage and selection operator (LASSO), to 2 genotyped and phenotyped reference populations consisting of 3,940 Holstein bulls and 1,172 Montbéliarde bulls with approximately 40,000 polymorphic SNP. We compared the accuracy of the bayesian methods for the prediction of 3 traits (milk yield, fat content, and conception rate) with pedigree-based BLUP, genomic BLUP, partial least squares (PLS) regression, and sparse PLS regression, a variable selection PLS variant. The results showed that the correlations between observed and predicted phenotypes were similar in BayesCπ (including or not pedigree information) and bayesian LASSO for most of the traits and whatever the breed. In the Holstein breed, bayesian methods led to higher correlations than other approaches for fat content and were similar to genomic BLUP for milk yield and to genomic BLUP and PLS regression for the conception rate. In the Montbéliarde breed, no method dominated the others, except BayesCπ for fat content. The better performances of the bayesian methods for fat content in Holstein and Montbéliarde breeds are probably due to the effect of the DGAT1 gene. The SNP identified by the BayesCπ, bayesian LASSO, and sparse PLS regression methods, based on their effect on the different traits of interest, were located at almost the same position on the genome. As the bayesian methods resulted in regressions of direct genomic values on daughter trait deviations closer to 1 than for the other methods tested in this study, bayesian methods are suggested for genomic evaluations of French dairy cattle.
Collapse
Affiliation(s)
- C Colombani
- INRA, UR631-SAGA, BP 52627, 31326 Castanet-Tolosan Cedex, France
| | | | | | | | | | | | | |
Collapse
|
27
|
Pintus MA, Gaspa G, Nicolazzi EL, Vicario D, Rossoni A, Ajmone-Marsan P, Nardone A, Dimauro C, Macciotta NPP. Prediction of genomic breeding values for dairy traits in Italian Brown and Simmental bulls using a principal component approach. J Dairy Sci 2012; 95:3390-400. [PMID: 22612973 DOI: 10.3168/jds.2011-4274] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 02/13/2012] [Indexed: 01/18/2023]
Abstract
The large number of markers available compared with phenotypes represents one of the main issues in genomic selection. In this work, principal component analysis was used to reduce the number of predictors for calculating genomic breeding values (GEBV). Bulls of 2 cattle breeds farmed in Italy (634 Brown and 469 Simmental) were genotyped with the 54K Illumina beadchip (Illumina Inc., San Diego, CA). After data editing, 37,254 and 40,179 single nucleotide polymorphisms (SNP) were retained for Brown and Simmental, respectively. Principal component analysis carried out on the SNP genotype matrix extracted 2,257 and 3,596 new variables in the 2 breeds, respectively. Bulls were sorted by birth year to create reference and prediction populations. The effect of principal components on deregressed proofs in reference animals was estimated with a BLUP model. Results were compared with those obtained by using SNP genotypes as predictors with either the BLUP or Bayes_A method. Traits considered were milk, fat, and protein yields, fat and protein percentages, and somatic cell score. The GEBV were obtained for prediction population by blending direct genomic prediction and pedigree indexes. No substantial differences were observed in squared correlations between GEBV and EBV in prediction animals between the 3 methods in the 2 breeds. The principal component analysis method allowed for a reduction of about 90% in the number of independent variables when predicting direct genomic values, with a substantial decrease in calculation time and without loss of accuracy.
Collapse
Affiliation(s)
- M A Pintus
- Dipartimento di Scienze Zootecniche, Università di Sassari, Sassari 07100, Italy
| | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Colombani C, Croiseau P, Fritz S, Guillaume F, Legarra A, Ducrocq V, Robert-Granié C. A comparison of partial least squares (PLS) and sparse PLS regressions in genomic selection in French dairy cattle. J Dairy Sci 2012; 95:2120-31. [PMID: 22459857 DOI: 10.3168/jds.2011-4647] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2011] [Accepted: 12/09/2011] [Indexed: 01/25/2023]
Abstract
Genomic selection involves computing a prediction equation from the estimated effects of a large number of DNA markers based on a limited number of genotyped animals with phenotypes. The number of observations is much smaller than the number of independent variables, and the challenge is to find methods that perform well in this context. Partial least squares regression (PLS) and sparse PLS were used with a reference population of 3,940 genotyped and phenotyped French Holstein bulls and 39,738 polymorphic single nucleotide polymorphism markers. Partial least squares regression reduces the number of variables by projecting independent variables onto latent structures. Sparse PLS combines variable selection and modeling in a one-step procedure. Correlations between observed phenotypes and phenotypes predicted by PLS and sparse PLS were similar, but sparse PLS highlighted some genome regions more clearly. Both PLS and sparse PLS were more accurate than pedigree-based BLUP and generally provided lower correlations between observed and predicted phenotypes than did genomic BLUP. Furthermore, PLS and sparse PLS required similar computing time to genomic BLUP for the study of 6 traits.
Collapse
Affiliation(s)
- C Colombani
- INRA, UR631-SAGA, BP 52627, 31326 Castanet-Tolosan Cedex, France.
| | | | | | | | | | | | | |
Collapse
|
29
|
Pintus M, Nicolazzi E, Van Kaam J, Biffani S, Stella A, Gaspa G, Dimauro C, Macciotta N. Use of different statistical models to predict direct genomic values for productive and functional traits in Italian Holsteins. J Anim Breed Genet 2012; 130:32-40. [DOI: 10.1111/j.1439-0388.2012.01019.x] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2012] [Accepted: 06/16/2012] [Indexed: 01/02/2023]
Affiliation(s)
- M.A. Pintus
- Dipartimento di Agraria-Sezione Scienze Zootecniche; Università di Sassari; Sassari; Italy
| | - E.L. Nicolazzi
- Istituto di Zootecnica; Università Cattolica del Sacro Cuore; Piacenza; Italy
| | | | - S. Biffani
- Associazione Nazionale Allevatori Frisona Italiana (ANAFI); Cremona; Italy
| | | | - G. Gaspa
- Dipartimento di Agraria-Sezione Scienze Zootecniche; Università di Sassari; Sassari; Italy
| | - C. Dimauro
- Dipartimento di Agraria-Sezione Scienze Zootecniche; Università di Sassari; Sassari; Italy
| | - N.P.P. Macciotta
- Dipartimento di Agraria-Sezione Scienze Zootecniche; Università di Sassari; Sassari; Italy
| |
Collapse
|
30
|
Abstract
Selective breeding is very important in agricultural production and breeding value estimation is the core of selective breeding. With the development of genetic markers, especially high throughput genotyping technology, it becomes available to estimate breeding value at genome level, i.e. genomic selection (GS). In this review, the methods of GS was categorized into two groups: one is to predict genomic estimated breeding value (GEBV) based on the allele effect, such as least squares, random regression - best linear unbiased prediction (RR-BLUP), Bayes and principle component analysis, etc; the other is to predict GEBV with genetic relationship matrix, which constructs genetic relationship matrix via high throughput genetic markers and then predicts GEBV through linear mixed model, i.e. GBLUP. The basic principles of these methods were also introduced according to the above two classifications. Factors affecting GS accuracy include markers of type and density, length of haplotype, the size of reference population, the extent between marker-QTL and so on. Among the methods of GS, Bayes and GBLUP are usually more accurate than the others and least squares is the worst. GBLUP is time-efficient and can combine pedigree with genotypic information, hence it is superior to other methods. Although progress was made in GS, there are still some challenges, for examples, united breeding, long-term genetic gain with GS, and disentangling markers with and without contribution to the traits. GS has been applied in animal and plant breeding practice and also has the potential to predict genetic predisposition in humans and study evolutionary dynamics. GS, which is more precise than the traditional method, is a breakthrough at measuring genetic relationship. Therefore, GS will be a revolutionary event in the history of animal and plant breeding.
Collapse
|
31
|
Feng ZZ, Yang X, Subedi S, McNicholas PD. The LASSO and sparse least square regression methods for SNP selection in predicting quantitative traits. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:629-636. [PMID: 22025756 DOI: 10.1109/tcbb.2011.139] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Recent work concerning quantitative traits of interest has focused on selecting a small subset of single nucleotide polymorphisms (SNPs) from amongst the SNPs responsible for the phenotypic variation of the trait. When considered as covariates, the large number of variables (SNPs) and their association with those in close proximity pose challenges for variable selection. The features of sparsity and shrinkage of regression coefficients of the least absolute shrinkage and selection operator (LASSO) method appear attractive for SNP selection. Sparse partial least squares (SPLS) is also appealing as it combines the features of sparsity in subset selection and dimension reduction to handle correlations amongst SNPs. In this paper we investigate application of the LASSO and SPLS methods for selecting SNPs that predict quantitative traits. We evaluate the performance of both methods with different criteria and under different scenarios using simulation studies. Results indicate that these methods can be effective in selecting SNPs that predict quantitative traits but are limited by some conditions. Both methods perform similarly overall but each exhibit advantages over the other in given situations. Both methods are applied to Canadian Holstein cattle data to compare their performance.
Collapse
|