1
|
Feldmann MJ, Piepho HP, Bridges WC, Knapp SJ. Average semivariance yields accurate estimates of the fraction of marker-associated genetic variance and heritability in complex trait analyses. PLoS Genet 2021; 17:e1009762. [PMID: 34437540 PMCID: PMC8425577 DOI: 10.1371/journal.pgen.1009762] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 09/08/2021] [Accepted: 08/09/2021] [Indexed: 12/15/2022] Open
Abstract
The development of genome-informed methods for identifying quantitative trait loci (QTL) and studying the genetic basis of quantitative variation in natural and experimental populations has been driven by advances in high-throughput genotyping. For many complex traits, the underlying genetic variation is caused by the segregation of one or more ‘large-effect’ loci, in addition to an unknown number of loci with effects below the threshold of statistical detection. The large-effect loci segregating in populations are often necessary but not sufficient for predicting quantitative phenotypes. They are, nevertheless, important enough to warrant deeper study and direct modelling in genomic prediction problems. We explored the accuracy of statistical methods for estimating the fraction of marker-associated genetic variance (p) and heritability ( HM2) for large-effect loci underlying complex phenotypes. We found that commonly used statistical methods overestimate p and HM2. The source of the upward bias was traced to inequalities between the expected values of variance components in the numerators and denominators of these parameters. Algebraic solutions for bias-correcting estimates of p and HM2 were found that only depend on the degrees of freedom and are constant for a given study design. We discovered that average semivariance methods, which have heretofore not been used in complex trait analyses, yielded unbiased estimates of p and HM2, in addition to best linear unbiased predictors of the additive and dominance effects of the underlying loci. The cryptic bias problem described here is unrelated to selection bias, although both cause the overestimation of p and HM2. The solutions we described are predicted to more accurately describe the contributions of large-effect loci to the genetic variation underlying complex traits of medical, biological, and agricultural importance. The contributions of individual genes to the phenotypic variation observed for genetically complex traits has been an ongoing and important challenge in biology, medicine, and agriculture. While many genes have statistically undetectable effects, those with large effects often warrant in-depth study and can be important predictors of complex phenotypes such as disease risk in humans or disease resistance in domesticated plants and animals. The genes identified through associations with genetic markers in complex trait analyses typically account for a fraction of the heritable variation, a genetic parameter we called ‘marker heritability’. We discovered that textbook statistical methods systematically overestimate marker heritability and thus overestimate the contributions of specific genes to the phenotypic variation observed for complex traits in natural and experimental populations. We describe the source of the upward bias, validate our findings through computer simulation, describe methods for bias-correcting estimates of marker heritability, and illustrate their application through empirical examples. The statistical methods we describe supply investigators with more accurate estimates of the contributions of specific genes or networks of interacting genes to the heritable variation observed in complex trait studies.
Collapse
Affiliation(s)
- Mitchell J. Feldmann
- Department of Plant Sciences, University of California, Davis, California, United States of America
| | - Hans-Peter Piepho
- Biostatistics Unit, Institute of Crop Science, University of Hohenheim, Stuttgart, Germany
| | - William C. Bridges
- Department of Mathematical Sciences, Clemson University, Clemson, South Carolina, United States of America
| | - Steven J. Knapp
- Department of Plant Sciences, University of California, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
2
|
Zhang J, Chen M, Wen Y, Zhang Y, Lu Y, Wang S, Chen J. A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies. Front Genet 2021; 12:649196. [PMID: 33854527 PMCID: PMC8041068 DOI: 10.3389/fgene.2021.649196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 03/01/2021] [Indexed: 11/13/2022] Open
Abstract
The mixed linear model (MLM) has been widely used in genome-wide association study (GWAS) to dissect quantitative traits in human, animal, and plant genetics. Most methodologies consider all single nucleotide polymorphism (SNP) effects as random effects under the MLM framework, which fail to detect the joint minor effect of multiple genetic markers on a trait. Therefore, polygenes with minor effects remain largely unexplored in today’s big data era. In this study, we developed a new algorithm under the MLM framework, which is called the fast multi-locus ridge regression (FastRR) algorithm. The FastRR algorithm first whitens the covariance matrix of the polygenic matrix K and environmental noise, then selects potentially related SNPs among large scale markers, which have a high correlation with the target trait, and finally analyzes the subset variables using a multi-locus deshrinking ridge regression for true quantitative trait nucleotide (QTN) detection. Results from the analyses of both simulated and real data show that the FastRR algorithm is more powerful for both large and small QTN detection, more accurate in QTN effect estimation, and has more stable results under various polygenic backgrounds. Moreover, compared with existing methods, the FastRR algorithm has the advantage of high computing speed. In conclusion, the FastRR algorithm provides an alternative algorithm for multi-locus GWAS in high dimensional genomic datasets.
Collapse
Affiliation(s)
- Jin Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China.,Postdoctoral Research Station of Crop Science, Nanjing Agricultural University, Nanjing, China
| | - Min Chen
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yangjun Wen
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yin Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yunan Lu
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Shengmeng Wang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Juncong Chen
- College of Finance, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
3
|
Sun J, Wu Q, Shen D, Wen Y, Liu F, Gao Y, Ding J, Zhang J. TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies. Sci Rep 2019; 9:18034. [PMID: 31792302 DOI: 10.1038/s41598-019-54519-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 11/15/2019] [Indexed: 11/24/2022] Open
Abstract
One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.
Collapse
|
4
|
Zhang J, Feng JY, Ni YL, Wen YJ, Niu Y, Tamba CL, Yue C, Song Q, Zhang YM. pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity (Edinb) 2017; 118:517-524. [PMID: 28295030 PMCID: PMC5436030 DOI: 10.1038/hdy.2017.8] [Citation(s) in RCA: 117] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Revised: 01/14/2017] [Accepted: 01/20/2017] [Indexed: 02/06/2023] Open
Abstract
Multilocus genome-wide association studies (GWAS) have become the state-of-the-art procedure to identify quantitative trait nucleotides (QTNs) associated with complex traits. However, implementation of multilocus model in GWAS is still difficult. In this study, we integrated least angle regression with empirical Bayes to perform multilocus GWAS under polygenic background control. We used an algorithm of model transformation that whitened the covariance matrix of the polygenic matrix K and environmental noise. Markers on one chromosome were included simultaneously in a multilocus model and least angle regression was used to select the most potentially associated single-nucleotide polymorphisms (SNPs), whereas the markers on the other chromosomes were used to calculate kinship matrix as polygenic background control. The selected SNPs in multilocus model were further detected for their association with the trait by empirical Bayes and likelihood ratio test. We herein refer to this method as the pLARmEB (polygenic-background-control-based least angle regression plus empirical Bayes). Results from simulation studies showed that pLARmEB was more powerful in QTN detection and more accurate in QTN effect estimation, had less false positive rate and required less computing time than Bayesian hierarchical generalized linear model, efficient mixed model association (EMMA) and least angle regression plus empirical Bayes. pLARmEB, multilocus random-SNP-effect mixed linear model and fast multilocus random-SNP-effect EMMA methods had almost equal power of QTN detection in simulation experiments. However, only pLARmEB identified 48 previously reported genes for 7 flowering time-related traits in Arabidopsis thaliana.
Collapse
Affiliation(s)
- J Zhang
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - J-Y Feng
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - Y-L Ni
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - Y-J Wen
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - Y Niu
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - C L Tamba
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - C Yue
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
| | - Q Song
- Soybean Genomics and Improvement Laboratory, Agricultural Research Service, United States Department of Agriculture, Beltsville, MD, USA
| | - Y-M Zhang
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing, China
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
5
|
Bu SH, Xinwang Z, Yi C, Wen J, Jinxing T, Zhang YM. Interacted QTL mapping in partial NCII design provides evidences for breeding by design. PLoS One 2015; 10:e0121034. [PMID: 25822501 PMCID: PMC4379165 DOI: 10.1371/journal.pone.0121034] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2014] [Accepted: 02/07/2015] [Indexed: 11/18/2022] Open
Abstract
The utilization of heterosis in rice, maize and rapeseed has revolutionized crop production. Although elite hybrid cultivars are mainly derived from the F1 crosses between two groups of parents, named NCII mating design, little has been known about the methodology of how interacted effects influence quantitative trait performance in the population. To bridge genetic analysis with hybrid breeding, here we integrated an interacted QTL mapping approach with breeding by design in partial NCII mating design. All the potential main and interacted effects were included in one full model. If the number of the effects is huge, bulked segregant analysis were used to test which effects were associated with the trait. All the selected effects were further shrunk by empirical Bayesian, so significant effects could be identified. A series of Monte Carlo simulations was performed to validate the new method. Furthermore, all the significant effects were used to calculate genotypic values of all the missing F1 hybrids, and all these F1 phenotypic or genotypic values were used to predict elite parents and parental combinations. Finally, the new method was adopted to dissect the genetic foundation of oil content in 441 rapeseed parents and 284 F1 hybrids. As a result, 8 main-effect QTL and 37 interacted QTL were found and used to predict 10 elite restorer lines, 10 elite sterile lines and 10 elite parental crosses. Similar results across various methods and in previous studies and a high correlation coefficient (0.76) between the predicted and observed phenotypes validated the proposed method in this study.
Collapse
Affiliation(s)
- Su Hong Bu
- State Key Laboratory of Crop Genetics and Germplasm Enhancement / Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Zhao Xinwang
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Can Yi
- State Key Laboratory of Crop Genetics and Germplasm Enhancement / Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Jia Wen
- State Key Laboratory of Crop Genetics and Germplasm Enhancement / Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Tu Jinxing
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Yuan Ming Zhang
- State Key Laboratory of Crop Genetics and Germplasm Enhancement / Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing Agricultural University, Nanjing, Jiangsu, China
- Statistical Genomics Lab, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, Hubei, China
- * E-mail:
| |
Collapse
|
6
|
Abstract
Integration and modularity refer to the patterns and processes of trait interaction and independence. Both terms have complex histories with respect to both conceptualization and quantification, resulting in a plethora of integration indices in use. We review briefly the divergent definitions, uses and measures of integration and modularity and make conceptual links to allometry. We also discuss how integration and modularity might evolve. Although integration is generally thought to be generated and maintained by correlational selection, theoretical considerations suggest the relationship is not straightforward. We caution here against uncontrolled comparisons of indices across studies. In the absence of controls for trait number, dimensionality, homology, development and function, it is difficult, or even impossible, to compare integration indices across organisms or traits. We suggest that care be invested in relating measurement to underlying theory or hypotheses, and that summative, theory-free descriptors of integration generally be avoided. The papers that follow in this Theme Issue illustrate the diversity of approaches to studying integration and modularity, highlighting strengths and pitfalls that await researchers investigating integration in plants and animals.
Collapse
Affiliation(s)
- W Scott Armbruster
- School of Biological Sciences, University of Portsmouth, Portsmouth PO12DY, UK Institute of Arctic Biology, University of Alaska, Fairbanks, AK 99775, USA Department of Biology, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Christophe Pélabon
- Center for Biodiversity Dynamics, Department of Biology, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Geir H Bolstad
- Center for Biodiversity Dynamics, Department of Biology, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Thomas F Hansen
- Centre for Ecological and Evolutionary Synthesis, Department of Biology, University of Oslo, PO Box 1066, 0316 Oslo, Norway
| |
Collapse
|
7
|
He X, Hu Z, Zhang YM. Genome-wide mapping of QTL associated with heterosis in the RIL-based NCIII design. Chin Sci Bull 2012. [DOI: 10.1007/s11434-012-5127-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|