1
|
Abstract
AbstractLong-term forecasting involves predicting a horizon that is far ahead of the last observation. It is a problem of high practical relevance, for instance for companies in order to decide upon expensive long-term investments. Despite the recent progress and success of Gaussian processes (GPs) based on spectral mixture kernels, long-term forecasting remains a challenging problem for these kernels because they decay exponentially at large horizons. This is mainly due to their use of a mixture of Gaussians to model spectral densities. Characteristics of the signal important for long-term forecasting can be unravelled by investigating the distribution of the Fourier coefficients of (the training part of) the signal, which is non-smooth, heavy-tailed, sparse, and skewed. The heavy tail and skewness characteristics of such distributions in the spectral domain allow to capture long-range covariance of the signal in the time domain. Motivated by these observations, we propose to model spectral densities using a skewed Laplace spectral mixture (SLSM) due to the skewness of its peaks, sparsity, non-smoothness, and heavy tail characteristics. By applying the inverse Fourier Transform to this spectral density we obtain a new GP kernel for long-term forecasting. In addition, we adapt the lottery ticket method, originally developed to prune weights of a neural network, to GPs in order to automatically select the number of kernel components. Results of extensive experiments, including a multivariate time series, show the beneficial effect of the proposed SLSM kernel for long-term extrapolation and robustness to the choice of the number of mixture components.
Collapse
|
2
|
|
3
|
Partovi Nia V, Ghannad-Rezaie M. Agglomerative joint clustering of metabolic data with spike at zero: A Bayesian perspective. Biom J 2015; 58:387-96. [DOI: 10.1002/bimj.201400110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Revised: 04/08/2015] [Accepted: 04/25/2015] [Indexed: 11/08/2022]
Affiliation(s)
- Vahid Partovi Nia
- GERAD research center and Department of Mathematical and Industrial Engineering; Polytechnique Montréal; 2900 Edouard-Montpetit J3T 1J4 Montréal Canada
| | - Mostafa Ghannad-Rezaie
- Department of Electrical Engineering; Massachusetts Institute of Technology; 77 Massachusetts Ave. Cambridge MA USA
| |
Collapse
|
4
|
Ganjali M, Baghfalaki T, Berridge D. Robust modeling of differential gene expression data using normal/independent distributions: a Bayesian approach. PLoS One 2015; 10:e0123791. [PMID: 25910040 PMCID: PMC4409222 DOI: 10.1371/journal.pone.0123791] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 03/07/2015] [Indexed: 11/18/2022] Open
Abstract
In this paper, the problem of identifying differentially expressed genes under different conditions using gene expression microarray data, in the presence of outliers, is discussed. For this purpose, the robust modeling of gene expression data using some powerful distributions known as normal/independent distributions is considered. These distributions include the Student’s t and normal distributions which have been used previously, but also include extensions such as the slash, the contaminated normal and the Laplace distributions. The purpose of this paper is to identify differentially expressed genes by considering these distributional assumptions instead of the normal distribution. A Bayesian approach using the Markov Chain Monte Carlo method is adopted for parameter estimation. Two publicly available gene expression data sets are analyzed using the proposed approach. The use of the robust models for detecting differentially expressed genes is investigated. This investigation shows that the choice of model for differentiating gene expression data is very important. This is due to the small number of replicates for each gene and the existence of outlying data. Comparison of the performance of these models is made using different statistical criteria and the ROC curve. The method is illustrated using some simulation studies. We demonstrate the flexibility of these robust models in identifying differentially expressed genes.
Collapse
Affiliation(s)
- Mojtaba Ganjali
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Department of Statistics, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- * E-mail: (MG)
| | - Taban Baghfalaki
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
| | - Damon Berridge
- Farr Institute-CIPHER, College of Medicine, Swansea University, Swansea, Wales, U.K.
| |
Collapse
|
5
|
Partovi Nia V, Davison AC. A simple model-based approach to variable selection in classification and clustering. CAN J STAT 2015. [DOI: 10.1002/cjs.11241] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Vahid Partovi Nia
- GERAD Research Center and Department of Mathematical and Industrial Engineering; Polytechnique Montréal; 2900 Edouard-Montpetit Montréal Canada J3T 1J4
| | - Anthony C. Davison
- École Polytechnique Fédérale de Lausanne; EPFL-FSB-MATHAA-STAT; Station 8 1015 Lausanne Switzerland
| |
Collapse
|
6
|
Hossain A, Beyene J. Application of skew-normal distribution for detecting differential expression to microRNA data. J Appl Stat 2014. [DOI: 10.1080/02664763.2014.962490] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
7
|
Franczak BC, Browne RP, McNicholas PD. Mixtures of Shifted AsymmetricLaplace Distributions. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2014; 36:1149-1157. [PMID: 26353277 DOI: 10.1109/tpami.2013.216] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the relationship with the generalized inverse Gaussian distribution. This approach is mathematically elegant and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated on both simulated and real data to illustrate clustering and classification applications. In these analyses, our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the popular Gaussian approach. This work, which marks an important step in the non-Gaussian model-based clustering and classification direction, concludes with discussion as well as suggestions for future work.
Collapse
|
8
|
|
9
|
Noma H, Matsui S. Empirical Bayes ranking and selection methods via semiparametric hierarchical mixture models in microarray studies. Stat Med 2012; 32:1904-16. [PMID: 23281021 DOI: 10.1002/sim.5718] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2011] [Accepted: 12/06/2012] [Indexed: 11/07/2022]
Abstract
The main purpose of microarray studies is screening of differentially expressed genes as candidates for further investigation. Because of limited resources in this stage, prioritizing genes are relevant statistical tasks in microarray studies. For effective gene selections, parametric empirical Bayes methods for ranking and selection of genes with largest effect sizes have been proposed (Noma et al., 2010; Biostatistics 11: 281-289). The hierarchical mixture model incorporates the differential and non-differential components and allows information borrowing across differential genes with separation from nuisance, non-differential genes. In this article, we develop empirical Bayes ranking methods via a semiparametric hierarchical mixture model. A nonparametric prior distribution, rather than parametric prior distributions, for effect sizes is specified and estimated using the "smoothing by roughening" approach of Laird and Louis (1991; Computational statistics and data analysis 12: 27-37). We present applications to childhood and infant leukemia clinical studies with microarrays for exploring genes related to prognosis or disease progression.
Collapse
Affiliation(s)
- Hisashi Noma
- Department of Data Science, The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo, 190-8562, Japan.
| | | |
Collapse
|
10
|
Reverse-engineering the genetic circuitry of a cancer cell with predicted intervention in chronic lymphocytic leukemia. Proc Natl Acad Sci U S A 2012; 110:459-64. [PMID: 23267079 DOI: 10.1073/pnas.1211130110] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Cellular behavior is sustained by genetic programs that are progressively disrupted in pathological conditions--notably, cancer. High-throughput gene expression profiling has been used to infer statistical models describing these cellular programs, and development is now needed to guide orientated modulation of these systems. Here we develop a regression-based model to reverse-engineer a temporal genetic program, based on relevant patterns of gene expression after cell stimulation. This method integrates the temporal dimension of biological rewiring of genetic programs and enables the prediction of the effect of targeted gene disruption at the system level. We tested the performance accuracy of this model on synthetic data before reverse-engineering the response of primary cancer cells to a proliferative (protumorigenic) stimulation in a multistate leukemia biological model (i.e., chronic lymphocytic leukemia). To validate the ability of our method to predict the effects of gene modulation on the global program, we performed an intervention experiment on a targeted gene. Comparison of the predicted and observed gene expression changes demonstrates the possibility of predicting the effects of a perturbation in a gene regulatory network, a first step toward an orientated intervention in a cancer cell genetic program.
Collapse
|
11
|
Abstract
Gene expression data are influenced by multiple biological and technological factors leading to a wide range of dispersion scenarios, although skewed patterns are not commonly addressed in microarray analyses. In this study, the distribution pattern of several human transcriptomes has been studied on free-access microarray gene expression data. Our results showed that, even in previously normalized gene expression data, probe and differential expression within probe effects suffer from substantial departures from the commonly assumed symmetric Gaussian distribution. We developed a flexible mixed model for non-competitive microarray data analysis that accounted for asymmetric and heavy-tailed (Student’s t distribution) dispersion processes. Random effects for gene expression data were modeled under asymmetric Student’s t distributions where the asymmetry parameter (λ) took values from perfect symmetry (λ = 0) to right- (λ>0) or left-side (λ>0) over-expression patterns. This approach was applied to four free-access human data sets and revealed clearly better model performance when comparing with standard approaches accounting for traditional symmetric Gaussian distribution patterns. Our analyses on human gene expression data revealed a substantial degree of right-hand asymmetry for probe effects, whereas differential gene expression addressed both symmetric and left-hand asymmetric patterns. Although these results cannot be extrapolated to all microarray experiments, they highlighted the incidence of skew dispersion patterns in human transcriptome; moreover, we provided a new analytical approach to appropriately address this biological phenomenon. The source code of the program accommodating these analytical developments and additional information about practical aspects on running the program are freely available by request to the corresponding author of this article.
Collapse
|
12
|
Punathumparambath B, Kulathinal S, George S. Asymmetric type II compound Laplace distribution and its application to microarray gene expression. Comput Stat Data Anal 2012. [DOI: 10.1016/j.csda.2011.10.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
13
|
Casellas J, Ibáñez-Escriche N. Bayesian recursive mixed linear model for gene expression analyses with continuous covariates. J Anim Sci 2011; 90:67-75. [PMID: 21908645 DOI: 10.2527/jas.2010-3750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The analysis of microarray gene expression data has experienced a remarkable growth in scientific research over the last few years and is helping to decipher the genetic background of several productive traits. Nevertheless, most analytical approaches have relied on the comparison of 2 (or a few) well-defined groups of biological conditions where the continuous covariates have no sense (e.g., healthy vs. cancerous cells). Continuous effects could be of special interest when analyzing gene expression in animal production-oriented studies (e.g., birth weight), although very few studies address this peculiarity in the animal science framework. Within this context, we have developed a recursive linear mixed model where not only are linear covariates accounted for during gene expression analyses but also hierarchized and the effects of their genetic, environmental, and residual components on differential gene expression inferred independently. This parameterization allows a step forward in the inference of differential gene expression linked to a given quantitative trait such as birth weight. The statistical performance of this recursive model was exemplified under simulation by accounting for different sample sizes (n), heritabilities for the quantitative trait (h(2)), and magnitudes of differential gene expression (λ). It is important to highlight that statistical power increased with n, h(2), and λ, and the recursive model exceeded the standard linear mixed model with linear (nonrecursive) covariates in the majority of scenarios. This new parameterization would provide new insights about gene expression in the animal science framework, opening a new research scenario where within-covariate sources of differential gene expression could be individualized and estimated. The source code of the program accommodating these analytical developments and additional information about practical aspects on running the program are freely available by request to the corresponding author of this article.
Collapse
Affiliation(s)
- J Casellas
- Grup de Recerca en Remugants, Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain.
| | | |
Collapse
|
14
|
Conlon EM, Postier BL, Methé BA, Nevin KP, Lovley DR. Hierarchical Bayesian meta-analysis models for cross-platform microarray studies. J Appl Stat 2009. [DOI: 10.1080/02664760802562480] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
15
|
A flexible approximate likelihood ratio test for detecting differential expression in microarray data. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2009.03.022] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
16
|
|
17
|
Salas-Gonzalez D, Kuruoglu EE, Ruiz DP. A heavy-tailed empirical Bayes method for replicated microarray data. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.08.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
18
|
Khalili A, Huang T, Lin S. A Robust Unified Approach to Analyzing Methylation and Gene Expression Data. Comput Stat Data Anal 2009; 53:1701-1710. [PMID: 20161265 DOI: 10.1016/j.csda.2008.07.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Microarray technology has made it possible to investigate expression levels, and more recently methylation signatures, of thousands of genes simultaneously, in a biological sample. Since more and more data from different biological systems or technological platforms are being generated at an incredible rate, there is an increasing need to develop statistical methods that are applicable to multiple data types and platforms. Motivated by such a need, a flexible finite mixture model that is applicable to methylation, gene expression, and potentially data from other biological systems, is proposed. Two major thrusts of this approach are to allow for a variable number of components in the mixture to capture non-biological variation and small biases, and to use a robust procedure for parameter estimation and probe classification. The method was applied to the analysis of methylation signatures of three breast cancer cell lines. It was also tested on three sets of expression microarray data to study its power and type I error rates. Comparison with a number of existing methods in the literature yielded very encouraging results; lower type I error rates and comparable/better power were achieved based on the limited study. Furthermore, the method also leads to more biologically interpretable results for the three breast cancer cell lines.
Collapse
Affiliation(s)
- Abbas Khalili
- Department of Statistics, The Ohio State University, Columbus, OH 43210, United States
| | | | | |
Collapse
|
19
|
Hao P, Zheng S, Ping J, Tu K, Gieger C, Wang-Sattler R, Zhong Y, Li Y. Human gene expression sensitivity according to large scale meta-analysis. BMC Bioinformatics 2009; 10 Suppl 1:S56. [PMID: 19208159 PMCID: PMC2648786 DOI: 10.1186/1471-2105-10-s1-s56] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Background Genes show different sensitivities in expression corresponding to various biological conditions. Systematical study of this concept is required because of its important implications in microarray analysis etc. J.H. Ohn et al. first studied this gene property with yeast transcriptional profiling data. Results Here we propose a calculation framework for gene expression sensitivity analysis. We also compared the functions, centralities and transcriptional regulations of the sensitive and robust genes. We found that the robust genes tended to be involved in essential cellular processes. Oppositely, the sensitive genes perform their functions diversely. Moreover while genes from both groups show similar geometric centrality by coupling them onto integrated protein networks, the robust genes have higher vertex degree and betweenness than that of the sensitive genes. An interesting fact was also found that, not alike the sensitive genes, the robust genes shared less transcription factors as their regulators. Conclusion Our study reveals different propensities of gene expression to external perturbations, demonstrates different roles of sensitive genes and robust genes in the cell and proposes the necessity of combining the gene expression sensitivity in the microarray analysis.
Collapse
Affiliation(s)
- Pei Hao
- Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, PR China.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Choi D, Nadarajah S. Information matrix for a mixture of two Laplace distributions. Stat Pap (Berl) 2009. [DOI: 10.1007/s00362-007-0053-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
21
|
Extreme value theory in analysis of differential expression in microarrays where either only up- or down-regulated genes are relevant or expected. Genet Res (Camb) 2008; 90:347-61. [PMID: 18840309 DOI: 10.1017/s0016672308009427] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
We propose an empirical Bayes method based on the extreme value theory (EVT) (BE) for the analysis of data from spotted microarrays where the interest of the investigator (e.g. to identify up-regulated gene markers of a disease) or the design of the experiment (e.g. in certain 'wild-type versus mutant' experiments) limits identification of differentially expressed genes to those regulated in a single direction (either up or down). In such experiments, unlike in genome-wide microarrays, analysis is restricted to the tail of the distribution (extremes) of all the genes in the genome. The EVT provides a platform to account for this extreme behaviour, and is therefore a natural candidate for inference about differential expression. We compared the performance of the developed BE method with two other empirical Bayes methods on two real 'wild-type versus mutant' datasets where a single direction of regulation was expected due to experimental design, and in a simulation study. The BE method appears to have a better fit to the real data. In the analysis of simulated data, the BE method showed better accuracy and precision while being robust to different characteristics of microarray experiments. The BE method, therefore, seems promising and useful for inference about differential expression in microarrays where either only up- or down-regulated genes are relevant or expected.
Collapse
|
22
|
|
23
|
Conlon EM. A Bayesian mixture model for metaanalysis of microarray studies. Funct Integr Genomics 2007; 8:43-53. [PMID: 17879102 DOI: 10.1007/s10142-007-0058-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2007] [Revised: 08/10/2007] [Accepted: 08/11/2007] [Indexed: 10/22/2022]
Abstract
The increased availability of microarray data has been calling for statistical methods to integrate findings across studies. A common goal of microarray analysis is to determine differentially expressed genes between two conditions, such as treatment vs control. A recent Bayesian metaanalysis model used a prior distribution for the mean log-expression ratios that was a mixture of two normal distributions. This model centered the prior distribution of differential expression at zero, and separated genes into two groups only: expressed and nonexpressed. Here, we introduce a Bayesian three-component truncated normal mixture prior model that more flexibly assigns prior distributions to the differentially expressed genes and produces three groups of genes: up and downregulated, and nonexpressed. We found in simulations of two and five studies that the three-component model outperformed the two-component model using three comparison measures. When analyzing biological data of Bacillus subtilis, we found that the three-component model discovered more genes and omitted fewer genes for the same levels of posterior probability of differential expression than the two-component model, and discovered more genes for fixed thresholds of Bayesian false discovery. We assumed that the data sets were produced from the same microarray platform and were prescaled.
Collapse
Affiliation(s)
- Erin M Conlon
- Department of Mathematics and Statistics, University of Massachusetts, 710 North Pleasant Street, Amherst, MA 01003-9305, USA.
| |
Collapse
|
24
|
Abstract
MOTIVATION Inference about differential expression is a typical objective when analyzing gene expression data. Recently, Bayesian hierarchical models have become increasingly popular for this type of problem. The two most common hierarchical models are the hierarchical Gamma-Gamma (GG) and Lognormal-Normal (LNN) models. However, to facilitate inference, some unrealistic assumptions have been made. One such assumption is that of a common coefficient of variation across genes, which can adversely affect the resulting inference. RESULTS In this paper, we extend both the GG and LNN modeling frameworks to allow for gene-specific variances and propose EM based algorithms for parameter estimation. The proposed methodology is evaluated on three experimental datasets: one cDNA microarray experiment and two Affymetrix spike-in experiments. The two extended models significantly reduce the false positive rate while keeping a high sensitivity when compared to the originals. Finally, using a simulation study we show that the new frameworks are also more robust to model misspecification. AVAILABILITY The R code for implementing the proposed methodology can be downloaded at http://www.stat.ubc.ca/~c.lo/FEBarrays. SUPPLEMENTARY INFORMATION The supplementary material is available at http://www.stat.ubc.ca/~c.lo/FEBarrays/supp.pdf.
Collapse
Affiliation(s)
- Kenneth Lo
- Department of Statistics, University of British Columbia, 333-6356 Agricultural Road, Vancouver, BC, Canada V6T 1Z2.
| | | |
Collapse
|