1
|
Martínez CA, Khare K, Rahman S, Báez GM. Graphical Model Selection to Infer the Partial Correlation Network of Allelic Effects in Genomic Prediction With an Application in Dairy Cattle. J Anim Breed Genet 2025. [PMID: 39836058 DOI: 10.1111/jbg.12921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 11/19/2024] [Accepted: 11/22/2024] [Indexed: 01/22/2025]
Abstract
We addressed genomic prediction accounting for partial correlation of marker effects, which entails the estimation of the partial correlation network/graph (PCN) and the precision matrix of an unobservable m-dimensional random variable. To this end, we developed a set of statistical models and methods by extending the canonical model selection problem in Gaussian concentration, and directed acyclic graph models. Our frequentist formulations combined existing methods with the EM algorithm and were termed Glasso-EM, Concord-EM and CSCS-EM, whereas our Bayesian formulations corresponded to hierarchical models termed Bayes G-Sel and Bayes DAG-Sel. We implemented our methods in a real bull fertility dataset and then carried out gene annotation of seven markers having the highest degrees in the estimated PCN. Our findings brought biological evidence supporting the usefulness of identifying genomic regions that are highly connected in the inferred PCN. Moreover, a simulation study showed that some of our methods can accurately recover the PCN (accuracy up to 0.98 using Concord-EM), estimate the precision matrix (Concord-EM yielded the best results) and predict breeding values (the best reliability was 0.85 for a trait with heritability of 0.5 using Glasso-EM).
Collapse
Affiliation(s)
- Carlos A Martínez
- Departamento de Producción Animal, Universidad Nacional de Colombia, Bogotá, Colombia
| | - Kshitij Khare
- Department of Statistics, University of Florida, Gainesville, Florida, USA
| | | | - Giovanni M Báez
- Departamento de Ciencias Agrícolas y Pecuarias, Universidad Francisco de Paula Santander, Cúcuta, Colombia
| |
Collapse
|
2
|
Bar H, Wells MT. On Graphical Models and Convex Geometry. Comput Stat Data Anal 2023; 187:107800. [PMID: 37396752 PMCID: PMC10310290 DOI: 10.1016/j.csda.2023.107800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
A mixture-model of beta distributions framework is introduced to identify significant correlations among P features when P is large. The method relies on theorems in convex geometry, which are used to show how to control the error rate of edge detection in graphical models. The proposed 'betaMix' method does not require any assumptions about the network structure, nor does it assume that the network is sparse. The results hold for a wide class of data-generating distributions that include light-tailed and heavy-tailed spherically symmetric distributions. The results are robust for sufficiently large sample sizes and hold for non-elliptically-symmetric distributions.
Collapse
Affiliation(s)
- Haim Bar
- Department of Statistics, University of Connecticut, Room 315, Philip E. Austin Building, Storrs, 06269-4120, CT, USA
| | - Martin T. Wells
- Department of Statistics and Data Science, Cornell University, 1190 Comstock Hall, Ithaca, 14853, NY, USA
| |
Collapse
|
3
|
Wu T, N. Narisetty N, Yang Y. Statistical inference via conditional Bayesian posteriors in high-dimensional linear regression. Electron J Stat 2023. [DOI: 10.1214/23-ejs2113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Affiliation(s)
- Teng Wu
- Department of Statistics, University of Illinois, Urbana Champaign
| | | | - Yun Yang
- Department of Statistics, University of Illinois, Urbana Champaign
| |
Collapse
|
4
|
Park Y, Su Z, Chung D. Envelope-based partial partial least squares with application to cytokine-based biomarker analysis for COVID-19. Stat Med 2022; 41:4578-4592. [PMID: 36111618 PMCID: PMC9350235 DOI: 10.1002/sim.9526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 05/27/2022] [Accepted: 06/27/2022] [Indexed: 11/18/2022]
Abstract
Partial least squares (PLS) regression is a popular alternative to ordinary least squares regression because of its superior prediction performance demonstrated in many cases. In various contemporary applications, the predictors include both continuous and categorical variables. A common practice in PLS regression is to treat the categorical variable as continuous. However, studies find that this practice may lead to biased estimates and invalid inferences (Schuberth et al., 2018). Based on a connection between the envelope model and PLS, we develop an envelope-based partial PLS estimator that considers the PLS regression on the conditional distributions of the response(s) and continuous predictors on the categorical predictors. Root-n consistency and asymptotic normality are established for this estimator. Numerical study shows that this approach can achieve more efficiency gains in estimation and produce better predictions. The method is applied for the identification of cytokine-based biomarkers for COVID-19 patients, which reveals the association between the cytokine-based biomarkers and patients' clinical information including disease status at admission and demographical characteristics. The efficient estimation leads to a clear scientific interpretation of the results.
Collapse
Affiliation(s)
- Yeonhee Park
- Department of Biostatistics and Medical InformaticsUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Zhihua Su
- Department of StatisticsUniversity of FloridaGainesvilleFloridaUSA
| | - Dongjun Chung
- Department of Biomedical InformaticsThe Ohio State UniversityColumbusOhioUSA
| |
Collapse
|
5
|
Zhang J, Fan X, Li Y, Ma S. Heterogeneous graphical model for non‐negative and non‐Gaussian PM2.5 data. J R Stat Soc Ser C Appl Stat 2022. [DOI: 10.1111/rssc.12575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Jiaqi Zhang
- Center for Applied Statistics and School of StatisticsRenmin University of China BeijingChina
| | - Xinyan Fan
- Center for Applied Statistics and School of StatisticsRenmin University of China BeijingChina
| | - Yang Li
- Center for Applied Statistics and School of StatisticsRenmin University of China BeijingChina
- RSS and China‐Re Life Joint Lab on Public Health and Risk ManagementRenmin University of China BeijingChina
| | - Shuangge Ma
- Department of BiostatisticsYale University New HavenUSA
| |
Collapse
|
6
|
Dallakyan A, Pourahmadi M. Fused-Lasso Regularized Cholesky Factors of Large Nonstationary Covariance Matrices of Replicated Time Series. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2090367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
Samanta S, Khare K, Michailidis G. A generalized likelihood-based Bayesian approach for scalable joint regression and covariance selection in high dimensions. STATISTICS AND COMPUTING 2022; 32:47. [PMID: 36713060 PMCID: PMC9881595 DOI: 10.1007/s11222-022-10102-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 04/27/2022] [Indexed: 06/05/2023]
Abstract
The paper addresses joint sparsity selection in the regression coefficient matrix and the error precision (inverse covariance) matrix for high-dimensional multivariate regression models in the Bayesian paradigm. The selected sparsity patterns are crucial to help understand the network of relationships between the predictor and response variables, as well as the conditional relationships among the latter. While Bayesian methods have the advantage of providing natural uncertainty quantification through posterior inclusion probabilities and credible intervals, current Bayesian approaches either restrict to specific sub-classes of sparsity patterns and/or are not scalable to settings with hundreds of responses and predictors. Bayesian approaches which only focus on estimating the posterior mode are scalable, but do not generate samples from the posterior distribution for uncertainty quantification. Using a bi-convex regression based generalized likelihood and spike-and-slab priors, we develop an algorithm called Joint Regression Network Selector (JRNS) for joint regression and covariance selection which (a) can accommodate general sparsity patterns, (b) provides posterior samples for uncertainty quantification, and (c) is scalable and orders of magnitude faster than the state-of-the-art Bayesian approaches providing uncertainty quantification. We demonstrate the statistical and computational efficacy of the proposed approach on synthetic data and through the analysis of selected cancer data sets. We also establish high-dimensional posterior consistency for one of the developed algorithms.
Collapse
|
8
|
Lee S, Kim SC, Yu D. An efficient GPU-parallel coordinate descent algorithm for sparse precision matrix estimation via scaled lasso. Comput Stat 2022. [DOI: 10.1007/s00180-022-01224-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Contraction of a quasi-Bayesian model with shrinkage priors in precision matrix estimation. J Stat Plan Inference 2022. [DOI: 10.1016/j.jspi.2022.03.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Some aspects of response variable selection and estimation in multivariate linear regression. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
11
|
Wang Y, Sun Z, Song D, Hero A. Kronecker-structured covariance models for multiway data. STATISTICS SURVEYS 2022. [DOI: 10.1214/22-ss139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Yu Wang
- University of Michigan, Ann Arbor, MI 48109
| | - Zeyu Sun
- University of Michigan, Ann Arbor, MI 48109
| | | | | |
Collapse
|
12
|
An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units. Comput Stat 2021. [DOI: 10.1007/s00180-021-01127-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
Byrd M, Nghiem LH, McGee M. Bayesian regularization of Gaussian graphical models with measurement error. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
14
|
Shojaie A. Differential Network Analysis: A Statistical Perspective. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2021; 13:e1508. [PMID: 37050915 PMCID: PMC10088462 DOI: 10.1002/wics.1508] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 03/03/2020] [Indexed: 11/06/2022]
Abstract
Networks effectively capture interactions among components of complex systems, and have thus become a mainstay in many scientific disciplines. Growing evidence, especially from biology, suggest that networks undergo changes over time, and in response to external stimuli. In biology and medicine, these changes have been found to be predictive of complex diseases. They have also been used to gain insight into mechanisms of disease initiation and progression. Primarily motivated by biological applications, this article provides a review of recent statistical machine learning methods for inferring networks and identifying changes in their structures.
Collapse
Affiliation(s)
- Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle WA
| |
Collapse
|
15
|
Conditional score matching for high-dimensional partial graphical models. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
|
17
|
Fei N, Yang Y, Bai X. One Core Task of Interpretability in Machine Learning — Expansion of Structural Equation Modeling. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s0218001420510015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Structural equation modeling (SEM) is a system of two kinds of equations: a linear latent structural model (SM) and a linear measurement model (MM). The latent structure model is a causal model from the latent parent node to the latent child node. Meanwhile, MM’s link is from latent variable parent node to observed variable child node. However, researchers should determine the initial causal order between variables based on experience when applying SEM. The main reason is that SEM does not fully construct causal models between observed variables (OVs) from big data. When the artificial causal order is contrary to the fact, the causal inference from SEM is doubtful, and the implicit causal information between the OVs cannot be extracted and utilized. This study first objectively identifies the causal order of variables using the DirectLiNGAM method widely accepted in recent years. Then traditional SEM is converted to expanded SEM (ESEM) consisting of SM, MM and observation model (OM). Finally, through model testing and debugging, ESEM with good fit with data is obtained.
Collapse
Affiliation(s)
- Nina Fei
- School of Mathematics and Statistics, Xidian University, 266 Xinglong Section of Xifeng Road, Xi’an, Shaanxi 710126, P. R. China
| | - Youlong Yang
- School of Mathematics and Statistics, Xidian University, 266 Xinglong Section of Xifeng Road, Xi’an, Shaanxi 710126, P. R. China
| | - Xuying Bai
- School of Mathematics and Statistics, Xidian University, 266 Xinglong Section of Xifeng Road, Xi’an, Shaanxi 710126, P. R. China
| |
Collapse
|
18
|
Loss function, unbiasedness, and optimality of Gaussian graphical model selection. J Stat Plan Inference 2019. [DOI: 10.1016/j.jspi.2018.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
19
|
Khare K, Oh SY, Rahman S, Rajaratnam B. A scalable sparse Cholesky based approach for learning high-dimensional covariance matrices in ordered data. Mach Learn 2019. [DOI: 10.1007/s10994-019-05810-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
20
|
Choi YG, Lim J, Roy A, Park J. Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2018.12.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
21
|
Fan X, Fang K, Ma S, Wang S, Zhang Q. Assisted graphical model for gene expression data analysis. Stat Med 2019; 38:2364-2380. [PMID: 30854706 DOI: 10.1002/sim.8112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 12/16/2018] [Accepted: 01/09/2019] [Indexed: 11/12/2022]
Abstract
The analysis of gene expression data has been playing a pivotal role in recent biomedical research. For gene expression data, network analysis has been shown to be more informative and powerful than individual-gene and geneset-based analysis. Despite promising successes, with the high dimensionality of gene expression data and often low sample sizes, network construction with gene expression data is still often challenged. In recent studies, a prominent trend is to conduct multidimensional profiling, under which data are collected on gene expressions as well as their regulators (copy number variations, methylation, microRNAs, SNPs, etc). With the regulation relationship, regulators contain information on gene expressions and can potentially assist in estimating their characteristics. In this study, we develop an assisted graphical model (AGM) approach, which can effectively use information in regulators to improve the estimation of gene expression graphical structure. The proposed approach has an intuitive formulation and can adaptively accommodate different regulator scenarios. Its consistency properties are rigorously established. Extensive simulations and the analysis of a breast cancer gene expression data set demonstrate the practical effectiveness of the AGM.
Collapse
Affiliation(s)
- Xinyan Fan
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Shuaichao Wang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| |
Collapse
|
22
|
|
23
|
Choi YG, Lim J, Choi S. High-dimensional Markowitz portfolio optimization problem: empirical comparison of covariance matrix estimators. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1577855] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Young-Geun Choi
- Department of Statistics, Seoul National University, Seoul, Korea
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul, Korea
| | - Sujung Choi
- School of Business, Soongsil University, Seoul, Korea
| |
Collapse
|
24
|
Ledoit O, Wolf M. Optimal estimation of a large-dimensional covariance matrix under Stein’s loss. BERNOULLI 2018. [DOI: 10.3150/17-bej979] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Yu D, Lee SH, Lim J, Xiao G, Craddock RC, Biswal BB. Fused Lasso Regression for Identifying Differential Correlations in Brain Connectome Graphs. Stat Anal Data Min 2018; 11:203-226. [PMID: 34386148 DOI: 10.1002/sam.11382] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
In this paper, we propose a procedure to find differential edges between two graphs from high-dimensional data. We estimate two matrices of partial correlations and their differences by solving a penalized regression problem. We assume sparsity only on differences between two graphs, not graphs themselves. Thus, we impose an ℓ 2 penalty on partial correlations and an ℓ 1 penalty on their differences in the penalized regression problem. We apply the proposed procedure to finding differential functional connectivity between healthy individuals and Alzheimer's disease patients.
Collapse
Affiliation(s)
- Donghyeon Yu
- Department of Statistics, Inha University, Incheon, South Korea
| | - Sang Han Lee
- Center for Biomedical Imaging and Neuromodulation, Nathan Kline Institute for Psychiatric Research, Orangeburg, NY 10962, USA
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Guanghua Xiao
- University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - R Cameron Craddock
- Department of Diagnostic Medicine, Dell Medical School, The University of Texas at Ausstin, TX 78712, USA
| | - Bharat B Biswal
- Department of Biomedical Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA
| |
Collapse
|
26
|
Gan L, Narisetty NN, Liang F. Bayesian Regularization for Graphical Models With Unequal Shrinkage. J Am Stat Assoc 2018. [DOI: 10.1080/01621459.2018.1482755] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Affiliation(s)
- Lingrui Gan
- Department of Statistics, University of Illinois at Urbana-Champaign, Urbana, IL
| | - Naveen N. Narisetty
- Department of Statistics, University of Illinois at Urbana-Champaign, Urbana, IL
| | - Feng Liang
- Department of Statistics, University of Illinois at Urbana-Champaign, Urbana, IL
| |
Collapse
|
27
|
Affiliation(s)
- Jacob Bien
- Data Sciences and Operations, Marshall School of Business, University of Southern California, CA
| |
Collapse
|
28
|
Khare K, Rajaratnam B, Saha A. Bayesian inference for Gaussian graphical models beyond decomposable graphs. J R Stat Soc Series B Stat Methodol 2018. [DOI: 10.1111/rssb.12276] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
29
|
Katayama S, Fujisawa H, Drton M. Robust and sparse Gaussian graphical modelling under cell-wise contamination. Stat (Int Stat Inst) 2018. [DOI: 10.1002/sta4.181] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Affiliation(s)
- Shota Katayama
- Department of Industrial Engineering and Economics; Tokyo Institute of Technology; 2-12-1 Ookayama Meguro-ku 152-8552 Tokyo Japan
| | - Hironori Fujisawa
- The Institute of Statistical Mathematics; 10-3 Midori-cho, Tachikawa; 190-8562 Tokyo Japan
- Graduate School of Medicine; Nagoya University; 65 Tsurumai-cho, Showa-ku Nagoya 466-8550 Japan
| | - Mathias Drton
- Department of Statistics; University of Washington; Seattle 98195-4322 WA USA
| |
Collapse
|
30
|
Martínez CA, Khare K, Rahman S, Elzo MA. Modeling correlated marker effects in genome-wide prediction via Gaussian concentration graph models. J Theor Biol 2017; 437:67-78. [PMID: 29055677 DOI: 10.1016/j.jtbi.2017.10.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Revised: 09/25/2017] [Accepted: 10/15/2017] [Indexed: 10/18/2022]
Abstract
In genome-wide prediction, independence of marker allele substitution effects is typically assumed; however, since early stages in the evolution of this technology it has been known that nature points to correlated effects. In statistics, graphical models have been identified as a useful and powerful tool for covariance estimation in high dimensional problems and it is an area that has recently experienced a great expansion. In particular, Gaussian concentration graph models (GCGM) have been widely studied. These are models in which the distribution of a set of random variables, the marker effects in this case, is assumed to be Markov with respect to an undirected graph G. In this paper, Bayesian (Bayes G and Bayes G-D) and frequentist (GML-BLUP) methods adapting the theory of GCGM to genome-wide prediction were developed. Different approaches to define the graph G based on domain-specific knowledge were proposed, and two propositions and a corollary establishing conditions to find decomposable graphs were proven. These methods were implemented in small simulated and real datasets. In our simulations, scenarios where correlations among allelic substitution effects were expected to arise due to various causes were considered, and graphs were defined on the basis of physical marker positions. Results showed improvements in correlation between phenotypes and predicted additive genetic values and accuracies of predicted additive genetic values when accounting for partially correlated allele substitution effects. Extensions to the multiallelic loci case were described and some possible refinements incorporating more flexible priors in the Bayesian setting were discussed. Our models are promising because they allow incorporation of biological information in the prediction process, and because they are more flexible and general than other models accounting for correlated marker effects that have been proposed previously.
Collapse
Affiliation(s)
| | - Kshitij Khare
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - Syed Rahman
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - Mauricio A Elzo
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| |
Collapse
|
31
|
Yen TJ, Lee ZR, Chen YH, Yen YM, Hwang JS. Estimating links of a network from time to event data. Ann Appl Stat 2017. [DOI: 10.1214/17-aoas1032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
32
|
Martínez CA, Khare K, Rahman S, Elzo MA. Gaussian covariance graph models accounting for correlated marker effects in genome-wide prediction. J Anim Breed Genet 2017; 134:412-421. [PMID: 28804930 DOI: 10.1111/jbg.12286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 06/30/2017] [Indexed: 11/26/2022]
Abstract
Several statistical models used in genome-wide prediction assume uncorrelated marker allele substitution effects, but it is known that these effects may be correlated. In statistics, graphical models have been identified as a useful tool for covariance estimation in high-dimensional problems and it is an area that has recently experienced a great expansion. In Gaussian covariance graph models (GCovGM), the joint distribution of a set of random variables is assumed to be Gaussian and the pattern of zeros of the covariance matrix is encoded in terms of an undirected graph G. In this study, methods adapting the theory of GCovGM to genome-wide prediction were developed (Bayes GCov, Bayes GCov-KR and Bayes GCov-H). In simulated data sets, improvements in correlation between phenotypes and predicted breeding values and accuracies of predicted breeding values were found. Our models account for correlation of marker effects and permit to accommodate general structures as opposed to models proposed in previous studies, which consider spatial correlation only. In addition, they allow incorporation of biological information in the prediction process through its use when constructing graph G, and their extension to the multi-allelic loci case is straightforward.
Collapse
Affiliation(s)
- C A Martínez
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| | - K Khare
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - S Rahman
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - M A Elzo
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| |
Collapse
|
33
|
Dalal O, Rajaratnam B. Sparse Gaussian graphical model estimation via alternating minimization. Biometrika 2017. [DOI: 10.1093/biomet/asx003] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
34
|
Su Z, Zhu G, Chen X, Yang Y. Sparse envelope model: efficient estimation and response variable selection in multivariate linear regression. Biometrika 2016. [DOI: 10.1093/biomet/asw036] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
35
|
Yu D. A study on bias effect of LASSO regression for model selection criteria. KOREAN JOURNAL OF APPLIED STATISTICS 2016. [DOI: 10.5351/kjas.2016.29.4.643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
36
|
Hero AO, Rajaratnam B. Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2016; 104:93-110. [PMID: 27087700 PMCID: PMC4827453 DOI: 10.1109/jproc.2015.2494178] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
Collapse
Affiliation(s)
- Alfred O Hero
- University of Michigan, Ann Arbor, MI 48109-2122, USA
| | | |
Collapse
|
37
|
Lin L, Drton M, Shojaie A. Estimation of High-Dimensional Graphical Models Using Regularized Score Matching. Electron J Stat 2016; 10:806-854. [PMID: 28638498 PMCID: PMC5476334 DOI: 10.1214/16-ejs1126] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Graphical models are widely used to model stochastic dependences among large collections of variables. We introduce a new method of estimating undirected conditional independence graphs based on the score matching loss, introduced by Hyvärinen (2005), and subsequently extended in Hyvärinen (2007). The regularized score matching method we propose applies to settings with continuous observations and allows for computationally efficient treatment of possibly non-Gaussian exponential family models. In the well-explored Gaussian setting, regularized score matching avoids issues of asymmetry that arise when applying the technique of neighborhood selection, and compared to existing methods that directly yield symmetric estimates, the score matching approach has the advantage that the considered loss is quadratic and gives piecewise linear solution paths under ℓ1 regularization. Under suitable irrepresentability conditions, we show that ℓ1-regularized score matching is consistent for graph estimation in sparse high-dimensional settings. Through numerical experiments and an application to RNAseq data, we confirm that regularized score matching achieves state-of-the-art performance in the Gaussian case and provides a valuable tool for computationally efficient estimation in non-Gaussian graphical models.
Collapse
Affiliation(s)
- Lina Lin
- Department of Statistics, University of Washington, Seattle, WA 98195, U.S.A
| | - Mathias Drton
- Department of Statistics, University of Washington, Seattle, WA 98195, U.S.A
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, WA 98195, U.S.A
| |
Collapse
|
38
|
Xiang R, Khare K, Ghosh M. High dimensional posterior convergence rates for decomposable graphical models. Electron J Stat 2015. [DOI: 10.1214/15-ejs1084] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|