1
|
TARO: tree-aggregated factor regression for microbiome data integration. BIOINFORMATICS (OXFORD, ENGLAND) 2024:btae321. [PMID: 38788190 DOI: 10.1093/bioinformatics/btae321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 03/16/2024] [Accepted: 05/15/2024] [Indexed: 05/26/2024]
Abstract
MOTIVATION Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. RESULTS We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the taxonomic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. AVAILABILITY AND IMPLEMENTATION The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package.
Collapse
|
2
|
A Dynamic Spatial Factor Model to Describe the Opioid Syndemic in Ohio. Epidemiology 2023; 34:487-494. [PMID: 37155617 PMCID: PMC10591492 DOI: 10.1097/ede.0000000000001617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
BACKGROUND The opioid epidemic has been ongoing for over 20 years in the United States. As opioid misuse has shifted increasingly toward injection of illicitly produced opioids, it has been associated with HIV and hepatitis C transmission. These epidemics interact to form the opioid syndemic. METHODS We obtain annual county-level counts of opioid overdose deaths, treatment admissions for opioid misuse, and newly diagnosed cases of acute and chronic hepatitis C and newly diagnosed HIV from 2014 to 2019. Aligned with the conceptual framework of syndemics, we develop a dynamic spatial factor model to describe the opioid syndemic for counties in Ohio and estimate the complex synergies between each of the epidemics. RESULTS We estimate three latent factors characterizing variation of the syndemic across space and time. The first factor reflects overall burden and is greatest in southern Ohio. The second factor describes harms and is greatest in urban counties. The third factor highlights counties with higher than expected hepatitis C rates and lower than expected HIV rates, which suggests elevated localized risk for future HIV outbreaks. CONCLUSIONS Through the estimation of dynamic spatial factors, we are able to estimate the complex dependencies and characterize the synergy across outcomes that underlie the syndemic. The latent factors summarize shared variation across multiple spatial time series and provide new insights into the relationships between the epidemics within the syndemic. Our framework provides a coherent approach for synthesizing complex interactions and estimating underlying sources of variation that can be applied to other syndemics.
Collapse
|
3
|
A Bayesian hierarchical sparse factor model for estimating simultaneous covariance matrices for gestational outcomes in consecutive pregnancies. Stat Med 2023. [PMID: 37276864 DOI: 10.1002/sim.9809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 02/09/2023] [Accepted: 05/10/2023] [Indexed: 06/07/2023]
Abstract
Covariance estimation for multiple groups is a key feature for drawing inference from a heterogeneous population. One should seek to share information about common features in the dependence structures across the various groups. In this paper, we introduce a novel approach for estimating the covariance matrices for multiple groups using a hierarchical latent factor model that shrinks the factor loadings across groups toward a global value. Using a sparse spike and slab model on these loading coefficients allows for a sparse formulation of our model. Parameter estimation is accomplished through a Markov chain Monte Carlo scheme, and a model selection approach is used to select the number of factors to use. We validate our model through extensive simulation studies. Finally, we apply our methodology to the NICHD Consecutive Pregnancies Study to estimate the correlations between birth weights and gestational ages of three consecutive birth within four different subgroups (underweight, normal, overweight, and obese) of women.
Collapse
|
4
|
Summary Intervals for Model-Based Classification Accuracy and Consistency Indices. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2023; 83:240-261. [PMID: 36866072 PMCID: PMC9972125 DOI: 10.1177/00131644221092347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
When scores are used to make decisions about respondents, it is of interest to estimate classification accuracy (CA), the probability of making a correct decision, and classification consistency (CC), the probability of making the same decision across two parallel administrations of the measure. Model-based estimates of CA and CC computed from the linear factor model have been recently proposed, but parameter uncertainty of the CA and CC indices has not been investigated. This article demonstrates how to estimate percentile bootstrap confidence intervals and Bayesian credible intervals for CA and CC indices, which have the added benefit of incorporating the sampling variability of the parameters of the linear factor model to summary intervals. Results from a small simulation study suggest that percentile bootstrap confidence intervals have appropriate confidence interval coverage, although displaying a small negative bias. However, Bayesian credible intervals with diffused priors have poor interval coverage, but their coverage improves once empirical, weakly informative priors are used. The procedures are illustrated by estimating CA and CC indices from a measure used to identify individuals low on mindfulness for a hypothetical intervention, and R code is provided to facilitate the implementation of the procedures.
Collapse
|
5
|
Heywood Cases in Unidimensional Factor Models and Item Response Models for Binary Data. APPLIED PSYCHOLOGICAL MEASUREMENT 2023; 47:141-154. [PMID: 36875295 PMCID: PMC9979198 DOI: 10.1177/01466216231151701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/03/2024]
Abstract
Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, factor models for ordinal data can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as non convergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations and thresholds) and an IRT model (with full information estimation), are used to analyze the same datasets. The results generalize across WLS, WLSMV, and ULS estimators for the factor models for ordinal data. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.
Collapse
|
6
|
James-Stein for the leading eigenvector. Proc Natl Acad Sci U S A 2023; 120:e2207046120. [PMID: 36603029 PMCID: PMC9926287 DOI: 10.1073/pnas.2207046120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Recent research identifies and corrects bias, such as excess dispersion, in the leading sample eigenvector of a factor-based covariance matrix estimated from a high-dimension low sample size (HL) data set. We show that eigenvector bias can have a substantial impact on variance-minimizing optimization in the HL regime, while bias in estimated eigenvalues may have little effect. We describe a data-driven eigenvector shrinkage estimator in the HL regime called "James-Stein for eigenvectors" (JSE) and its close relationship with the James-Stein (JS) estimator for a collection of averages. We show, both theoretically and with numerical experiments, that, for certain variance-minimizing problems of practical importance, efforts to correct eigenvalues have little value in comparison to the JSE correction of the leading eigenvector. When certain extra information is present, JSE is a consistent estimator of the leading eigenvector.
Collapse
|
7
|
Bayesian Factor-adjusted Sparse Regression. JOURNAL OF ECONOMETRICS 2022; 230:3-19. [PMID: 35754940 PMCID: PMC9223477 DOI: 10.1016/j.jeconom.2020.06.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Many sparse regression methods are based on the assumption that covariates are weakly correlated, which unfortunately do not hold in many economic and financial datasets. To address this challenge, we model the strongly-correlated covariates by a factor structure: strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components. We then propose a factor-adjusted sparse regression model with both common factors and idiosyncratic components as decorrelated covariates and develop a semi-Bayesian method. Parameter estimation rate-optimality and model selection consistency are established by non-asymptotic analyses. We show on simulated data that the semi-Bayesian method outperforms its Lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price when covariates are not correlated, scales up well with increasing sample size, dimensionality and sparsity, and converges fast to the equilibrium of the posterior distribution. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators also lend strong supports to the proposed method.
Collapse
|
8
|
Using sufficient direction factor model to analyze latent activities associated with breast cancer survival. Biometrics 2020; 76:1340-1350. [PMID: 31860141 PMCID: PMC7305041 DOI: 10.1111/biom.13208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 09/20/2019] [Accepted: 12/16/2019] [Indexed: 11/27/2022]
Abstract
High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes. Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Second, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse. In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data.
Collapse
|
9
|
A multivariate spatio-temporal model of the opioid epidemic in Ohio: A factor model approach. HEALTH SERVICES AND OUTCOMES RESEARCH METHODOLOGY 2020; 21:42-53. [PMID: 34305443 DOI: 10.1007/s10742-020-00227-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Opioid misuse is a significant public health issue and a national epidemic with a high prevalence of associated morbidity and mortality. The epidemic is particularly severe in Ohio which has some of the highest overdose rates in the country. It is important to understand spatial and temporal trends of the opioid epidemic to learn more about areas that are most affected and to inform potential community interventions and resource allocation. We propose a multivariate spatio-temporal model to leverage existing surveillance measures, opioid-associated deaths and treatment admissions, to learn about the underlying epidemic for counties in Ohio. We do this using a temporally varying spatial factor that synthesizes information from both counts to estimate common underlying risk which we interpret as the burden of the epidemic. We demonstrate the use of this model with county-level data from 2007-2018 in Ohio. Through our model estimates, we identify counties with above and below average burden and examine how those regions have shifted over time given overall statewide trends. Specifically, we highlight the sustained above average burden of the opioid epidemic on southern Ohio throughout the 12 years examined.
Collapse
|
10
|
A Projection-based Conditional Dependence Measure with Applications to High-dimensional Undirected Graphical Models. JOURNAL OF ECONOMETRICS 2020; 218:119-139. [PMID: 33208987 PMCID: PMC7668417 DOI: 10.1016/j.jeconom.2019.12.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Measuring conditional dependence is an important topic in econometrics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic type I error and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. We show the superiority of the new method, implemented in the R package pgraph, through simulation and real data studies.
Collapse
|
11
|
All Happy Emotions Are Alike but Every Unhappy Emotion Is Unhappy in Its Own Way: A Network Perspective to Academic Emotions. Front Psychol 2020; 11:742. [PMID: 32425855 PMCID: PMC7203500 DOI: 10.3389/fpsyg.2020.00742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 03/26/2020] [Indexed: 11/13/2022] Open
Abstract
Quantitative research into the nature of academic emotions has thus far been dominated by factor analyses of questionnaire data. Recently, psychometric network analysis has arisen as an alternative method of conceptualizing the composition of psychological phenomena such as emotions: while factor models view emotions as underlying causes of affects, cognitions and behavior, in network models psychological phenomena are viewed as arising from the interactions of their component parts. We argue that the network perspective is of interest to studies of academic emotions due to its compatibility with the theoretical assumptions of the control value theory of academic emotions. In this contribution we assess the structure of a Finnish questionnaire of academic emotions using both network analysis and exploratory factor analysis on cross-sectional data obtained during a single course. The global correlational structure of the network, investigated using the spinglass community detection analysis, differed from the results of the factor analysis mainly in that positive emotions were grouped in one community but loaded on different factors. Local associations between pairs of variables in the network model may arise due to different reasons, such as variable A causing variation in variable B or vice versa, or due to a latent variable affecting both. We view the relationship between feelings of self-efficacy and the other emotions as causal hypotheses, and argue that strengthening the students' self-efficacy may have a beneficial effect on the rest of the emotions they experienced on the course. Other local associations in the network model are argued to arise due to unmodeled latent variables. Future psychometric studies may benefit from combining network models and factor models in researching the structure of academic emotions.
Collapse
|
12
|
De-Biased Graphical Lasso for High-Frequency Data. ENTROPY 2020; 22:e22040456. [PMID: 33286230 PMCID: PMC7516938 DOI: 10.3390/e22040456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 04/10/2020] [Accepted: 04/14/2020] [Indexed: 11/30/2022]
Abstract
This paper develops a new statistical inference theory for the precision matrix of high-frequency data in a high-dimensional setting. The focus is not only on point estimation but also on interval estimation and hypothesis testing for entries of the precision matrix. To accomplish this purpose, we establish an abstract asymptotic theory for the weighted graphical Lasso and its de-biased version without specifying the form of the initial covariance estimator. We also extend the scope of the theory to the case that a known factor structure is present in the data. The developed theory is applied to the concrete situation where we can use the realized covariance matrix as the initial covariance estimator, and we obtain a feasible asymptotic distribution theory to construct (simultaneous) confidence intervals and (multiple) testing procedures for entries of the precision matrix.
Collapse
|
13
|
A comprehensive psychometric analysis of autism-spectrum quotient factor models using two large samples: Model recommendations and the influence of divergent traits on total-scale scores. Autism Res 2019; 13:45-60. [PMID: 31464106 DOI: 10.1002/aur.2198] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Revised: 08/06/2019] [Accepted: 08/09/2019] [Indexed: 01/23/2023]
Abstract
The Autism-Spectrum Quotient (AQ) is a psychometric scale that is commonly used to assess autistic-like traits and behaviors expressed by neurotypical individuals. A potential strength of the AQ is that it provides subscale scores that are specific to certain dimensions associated with autism such as social difficulty and restricted interests. However, multiple psychometric evaluations of the AQ have led to substantial disagreement as to how many factors exist in the scale, and how these factors are defined. These challenges have been exacerbated by limitations in study designs, such as insufficient sample sizes as well as a reliance on Pearson, rather than polychoric, correlations. In addition, several proposed models of the AQ suggest that some factors are uncorrelated, or negatively correlated, which has ramifications for whether total-scale scores are meaningfully interpretable-an issue not raised by previous work. The aims of the current study were to provide: (a) guidance as to which models of the AQ are viable for research purposes, and (b) evidence as to whether total-scale scores are adequately interpretable for research purposes. We conducted a comprehensive series of confirmatory factor analyses on 11 competing AQ models using two large samples drawn from an undergraduate population (n = 1,702) and the general population (n = 1,280). Psychometric evidence largely supported using the three-factor model described by Russell-Smith et al. [Personality and Individual Differences 51(2), 128-132 (2011)], but did not support the use of total-scale scores. We recommend that researchers consider using AQ subscale scores instead of total-scale scores. Autism Res 2020, 13: 45-60. © 2019 International Society for Autism Research, Wiley Periodicals, Inc. LAY SUMMARY: We examined 11 different ways of scoring subscales in the popular Autism-Spectrum Quotient (AQ) questionnaire in two large samples of participants (i.e., general population and undergraduate students). We found that a three-subscale model that used "Social Skill," "Patterns/Details," and "Communication/Mindreading" subscales was the best way to examine specific types of autistic traits in the AQ. We also found some weak associations between the three subscales-for example, being high on the "Patterns/Details" subscale was not predictive of scores on the other subscales. This means that meaningful interpretation of overall scores on the AQ is limited.
Collapse
|
14
|
Collaborative Problem Solving: Processing Actions, Time, and Performance. Front Psychol 2019; 10:1280. [PMID: 31231281 PMCID: PMC6566913 DOI: 10.3389/fpsyg.2019.01280] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 05/15/2019] [Indexed: 11/13/2022] Open
Abstract
This study is based on one collaborative problem solving task from an international assessment: the Xandar task. It was developed and delivered by the Organization for Economic Co-operation and Development Program for International Student Assessment (OECD PISA) 2015. We have investigated the relationship of problem solving performance with invested time and number of actions in collaborative episodes for the four parts of the Xandar task. The parts require the respondent to collaboratively plan a process for problem solving, implement the process, reach a solution, and evaluate the solution (For a full description, see the Materials and Methods section, "Parts of the Xandar Task.") Examples of an action include posting to a chat log, accessing a shared resource, or conducting a search on a map tool. Actions taken in each part of the task were identified by PISA and recorded in the data set numerically. A confirmatory factor analysis (CFA) model looks at two types of relationship: at the level of latent variables (the factors) and at extra dependencies, which here are direct effects and correlated residuals (independent of the factors). The model, which is well-fitting, has three latent variables: actions (A), times (T), and level of performance (P). Evidence for the uni-dimensionality of performance level is also found in a separate analysis of the binary items. On the whole for the entire task, participants with more activities are less successful and faster, based on the United States data set employed in the analysis. By contrast, successful participants take more time. By task part, the model also investigates relationships between activities, time, and performance level within the parts. This was done because one can expect dependencies within parts of such a complex task. Results indicate some general and some specific relationships within the parts, see the full manuscript for more detail. We conclude with a discussion of what the investigated relationships may reveal. We also describe why such investigations may be important to consider when preparing students for improved skills in collaborative problem solving, considered a key aspect of successful 21st century skills in the workplace and in everyday life in many countries.
Collapse
|
15
|
Abstract
We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional (approximate) factor model implemented by the principal component analysis. Using the extracted factors, we develop a novel forecasting method called the sufficient forecasting, which provides a set of sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. The projected principal component analysis will be employed to enhance the accuracy of inferred factors when a semi-parametric (approximate) factor model is assumed. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even in the presence of a nonparametric forecasting function. The proposed method extends the sufficient dimension reduction to high-dimensional regimes by condensing the cross-sectional information through factor models. We derive asymptotic properties for the estimate of the central subspace spanned by these projection directions as well as the estimates of the sufficient predictive indices. We further show that the natural method of running multiple regression of target on estimated factors yields a linear estimate that actually falls into this central subspace. Our method and theory allow the number of predictors to be larger than the number of observations. We finally demonstrate that the sufficient forecasting improves upon the linear forecasting in both simulation studies and an empirical study of forecasting macroeconomic variables.
Collapse
|
16
|
Using a latent variable model with non-constant factor loadings to examine PM 2.5 constituents related to secondary inorganic aerosols. STAT MODEL 2016; 16:91-113. [PMID: 27528825 PMCID: PMC4982519 DOI: 10.1177/1471082x15627004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Factor analysis is a commonly used method of modelling correlated multivariate exposure data. Typically, the measurement model is assumed to have constant factor loadings. However, from our preliminary analyses of the Environmental Protection Agency's (EPA's) PM2.5 fine speciation data, we have observed that the factor loadings for four constituents change considerably in stratified analyses. Since invariance of factor loadings is a prerequisite for valid comparison of the underlying latent variables, we propose a factor model that includes non-constant factor loadings that change over time and space using P-spline penalized with the generalized cross-validation (GCV) criterion. The model is implemented using the Expectation-Maximization (EM) algorithm and we select the multiple spline smoothing parameters by minimizing the GCV criterion with Newton's method during each iteration of the EM algorithm. The algorithm is applied to a one-factor model that includes four constituents. Through bootstrap confidence bands, we find that the factor loading for total nitrate changes across seasons and geographic regions.
Collapse
|
17
|
Covariance adjustment for batch effect in gene expression data. Stat Med 2014; 33:2681-95. [PMID: 24687561 PMCID: PMC4065794 DOI: 10.1002/sim.6157] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2013] [Revised: 11/26/2013] [Accepted: 02/28/2014] [Indexed: 02/01/2023]
Abstract
Batch bias has been found in many microarray gene expression studies that involve multiple batches of samples. A serious batch effect can alter not only the distribution of individual genes but also the inter-gene relationships. Even though some efforts have been made to remove such bias, there has been relatively less development on a multivariate approach, mainly because of the analytical difficulty due to the high-dimensional nature of gene expression data. We propose a multivariate batch adjustment method that effectively eliminates inter-gene batch effects. The proposed method utilizes high-dimensional sparse covariance estimation based on a factor model and a hard thresholding. Another important aspect of the proposed method is that if it is known that one of the batches is produced in a superior condition, the other batches can be adjusted so that they resemble the target batch. We study high-dimensional asymptotic properties of the proposed estimator and compare the performance of the proposed method with some popular existing methods with simulated data and gene expression data sets.
Collapse
|
18
|
Dissecting high-dimensional phenotypes with bayesian sparse factor analysis of genetic covariance matrices. Genetics 2013; 194:753-67. [PMID: 23636737 PMCID: PMC3697978 DOI: 10.1534/genetics.113.151217] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 04/17/2013] [Indexed: 01/29/2023] Open
Abstract
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed-effects model. The key idea of our model is that we need consider only G-matrices that are biologically plausible. An organism's entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse - affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set.
Collapse
|
19
|
Influence analysis for high-dimensional time series with an application to epileptic seizure onset zone detection. J Neurosci Methods 2013; 214:80-90. [PMID: 23354014 PMCID: PMC3719213 DOI: 10.1016/j.jneumeth.2012.12.025] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Revised: 12/27/2012] [Accepted: 12/29/2012] [Indexed: 11/03/2022]
Abstract
Granger causality is a useful concept for studying causal relations in networks. However, numerical problems occur when applying the corresponding methodology to high-dimensional time series showing co-movement, e.g. EEG recordings or economic data. In order to deal with these shortcomings, we propose a novel method for the causal analysis of such multivariate time series based on Granger causality and factor models. We present the theoretical background, successfully assess our methodology with the help of simulated data and show a potential application in EEG analysis of epileptic seizures.
Collapse
|