1
|
Nghiem LH, Hui FKC, Muller S, Welsh AH. Screening methods for linear errors-in-variables models in high dimensions. Biometrics 2022. [PMID: 35191015 DOI: 10.1111/biom.13628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 01/11/2022] [Indexed: 11/29/2022]
Abstract
Microarray studies, in order to identify genes associated with an outcome of interest, usually produce noisy measurements for a large number of gene expression features from a small number of subjects. One common approach to analyzing such high-dimensional data is to use linear errors-in-variables models; however, current methods for fitting such models are computationally expensive. In this paper, we present two efficient screening procedures, namely corrected penalized marginal screening and corrected sure independence screening, to reduce the number of variables for final model building. Both screening procedures are based on fitting corrected marginal regression models relating the outcome to each contaminated covariate separately, which can be computed efficiently even with a large number of features. Under mild conditions, we show that these procedures achieve screening consistency and reduce the number of features substantially, even when the number of covariates grows exponentially with sample size. Additionally, if the true covariates are weakly correlated, we show that corrected penalized marginal screening can achieve full variable selection consistency. Through a simulation study and an analysis of gene expression data for bone mineral density of Norwegian women, we demonstrate that the two new screening procedures make estimation of linear errors-in-variables models computationally scalable in high dimensional settings, and improve finite sample estimation and selection performance compared with estimators that do not employ a screening stage. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Linh H Nghiem
- Research School of Finance, Actuarial Studies and Statistics, Australian National University, ACT 2600, Australia.,School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
| | - Francis K C Hui
- Research School of Finance, Actuarial Studies and Statistics, Australian National University, ACT 2600, Australia
| | - Samuel Muller
- Department of Mathematics and Statistics, Macquarie University, NSW 2109, Australia
| | - A H Welsh
- Research School of Finance, Actuarial Studies and Statistics, Australian National University, ACT 2600, Australia
| |
Collapse
|
2
|
Nghiem L, Potgieter C. Simulation-selection-extrapolation: Estimation in high-dimensional errors-in-variables models. Biometrics 2019; 75:1133-1144. [PMID: 31260084 DOI: 10.1111/biom.13112] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 06/25/2019] [Indexed: 11/29/2022]
Abstract
Errors-in-variables models in high-dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation-SELection-EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors-in-variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline-based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.
Collapse
Affiliation(s)
- Linh Nghiem
- Research School of Finance, Actuarial Studies and Statistics, College of Business and Economics, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Cornelis Potgieter
- Department of Mathematics, Texas Christian University, Fort Worth, Texas.,Department of Statistics, University of Johannesburg, Johannesburg, South Africa
| |
Collapse
|
3
|
Romeo G, Thoresen M. Model selection in high-dimensional noisy data: a simulation study. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1607345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Giovanni Romeo
- Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway
| | - Magne Thoresen
- Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway
| |
Collapse
|
4
|
Affiliation(s)
- Xiangyu Luo
- Department of Statistics, The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
| | - Yingying Wei
- Department of Statistics, The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
| |
Collapse
|
5
|
Transcriptomic analysis of the heat stress response for a commercial baker's yeast Saccharomyces cerevisiae. Genes Genomics 2018; 40:137-150. [PMID: 29892925 DOI: 10.1007/s13258-017-0616-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 10/01/2017] [Indexed: 10/18/2022]
Abstract
The aim of this study is to explore the effects of heat stresses on global gene expression profiles and to identify the candidate genes for the heat stress response in commercial baker's yeast (Saccharomyces cerevisiae) by using microarray technology and comparative statistical data analyses. The data from all hybridizations and array normalization were analyzed using the GeneSpringGX 12.1 (Agilent) and the R 2.15.2 program language. In the analysis, all required statistical methods were performed comparatively. For the normalization step, among alternatives, the RMA (Robust Microarray Analysis) results were used. To determine differentially expressed genes under heat stress treatments, the fold-change and the hypothesis testing approaches were executed under various cut-off values via different multiple testing procedures then the up/down regulated probes were functionally categorized via the PAMSAM clustering. The results of the analysis concluded that the transcriptome changes under the heat shock. Moreover, the temperature-shift stress treatments show that the number of differentially up-regulated genes among the heat shock proteins and transcription factors changed significantly. Finally, the change in temperature is one of the important environmental conditions affecting propagation and industrial application of baker's yeast. This study statistically analyzes this affect via one-channel microarray data.
Collapse
|
6
|
Liu X, Zhang L, Chen S. Modeling Exon-Specific Bias Distribution Improves the Analysis of RNA-Seq Data. PLoS One 2015; 10:e0140032. [PMID: 26448625 PMCID: PMC4598124 DOI: 10.1371/journal.pone.0140032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2015] [Accepted: 09/21/2015] [Indexed: 11/29/2022] Open
Abstract
RNA-seq technology has become an important tool for quantifying the gene and transcript expression in transcriptome study. The two major difficulties for the gene and transcript expression quantification are the read mapping ambiguity and the overdispersion of the read distribution along reference sequence. Many approaches have been proposed to deal with these difficulties. A number of existing methods use Poisson distribution to model the read counts and this easily splits the counts into the contributions from multiple transcripts. Meanwhile, various solutions were put forward to account for the overdispersion in the Poisson models. By checking the similarities among the variation patterns of read counts for individual genes, we found that the count variation is exon-specific and has the conserved pattern across the samples for each individual gene. We introduce Gamma-distributed latent variables to model the read sequencing preference for each exon. These variables are embedded to the rate parameter of a Poisson model to account for the overdispersion of read distribution. The model is tractable since the Gamma priors can be integrated out in the maximum likelihood estimation. We evaluate the proposed approach, PGseq, using four real datasets and one simulated dataset, and compare its performance with other popular methods. Results show that PGseq presents competitive performance compared to other alternatives in terms of accuracy in the gene and transcript expression calculation and in the downstream differential expression analysis. Especially, we show the advantage of our method in the analysis of low expression.
Collapse
Affiliation(s)
- Xuejun Liu
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
- * E-mail:
| | - Li Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Songcan Chen
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| |
Collapse
|
7
|
Jow H, Boys RJ, Wilkinson DJ. Bayesian identification of protein differential expression in multi-group isobaric labelled mass spectrometry data. Stat Appl Genet Mol Biol 2014; 13:531-51. [PMID: 25153608 DOI: 10.1515/sagmb-2012-0066] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper we develop a Bayesian statistical inference approach to the unified analysis of isobaric labelled MS/MS proteomic data across multiple experiments. An explicit probabilistic model of the log-intensity of the isobaric labels' reporter ions across multiple pre-defined groups and experiments is developed. This is then used to develop a full Bayesian statistical methodology for the identification of differentially expressed proteins, with respect to a control group, across multiple groups and experiments. This methodology is implemented and then evaluated on simulated data and on two model experimental datasets (for which the differentially expressed proteins are known) that use a TMT labelling protocol.
Collapse
|
8
|
Hellton KH, Thoresen M. The Impact of Measurement Error on Principal Component Analysis. Scand Stat Theory Appl 2014. [DOI: 10.1111/sjos.12083] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - Magne Thoresen
- Department of Biostatistics, Institute of Basic Medical Sciences; University of Oslo
| |
Collapse
|
9
|
Wirth H, von Bergen M, Binder H. Mining SOM expression portraits: feature selection and integrating concepts of molecular function. BioData Min 2012; 5:18. [PMID: 23043905 PMCID: PMC3599960 DOI: 10.1186/1756-0381-5-18] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2011] [Accepted: 09/14/2012] [Indexed: 11/30/2022] Open
Abstract
Background Self organizing maps (SOM) enable the straightforward portraying of high-dimensional data of large sample collections in terms of sample-specific images. The analysis of their texture provides so-called spot-clusters of co-expressed genes which require subsequent significance filtering and functional interpretation. We address feature selection in terms of the gene ranking problem and the interpretation of the obtained spot-related lists using concepts of molecular function. Results Different expression scores based either on simple fold change-measures or on regularized Student’s t-statistics are applied to spot-related gene lists and compared with special emphasis on the error characteristics of microarray expression data. The spot-clusters are analyzed using different methods of gene set enrichment analysis with the focus on overexpression and/or overrepresentation of predefined sets of genes. Metagene-related overrepresentation of selected gene sets was mapped into the SOM images to assign gene function to different regions. Alternatively we estimated set-related overexpression profiles over all samples studied using a gene set enrichment score. It was also applied to the spot-clusters to generate lists of enriched gene sets. We used the tissue body index data set, a collection of expression data of human tissues as an illustrative example. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. In addition, we display special sets of housekeeping and of consistently weak and high expressed genes using SOM data filtering. Conclusions The presented methods allow the comprehensive downstream analysis of SOM-transformed expression data in terms of cluster-related gene lists and enriched gene sets for functional interpretation. SOM clustering implies the ability to define either new gene sets using selected SOM spots or to verify and/or to amend existing ones.
Collapse
Affiliation(s)
- Henry Wirth
- Interdisciplinary Centre for Bioinformatics of Leipzig University, Härtelstr, 16-18, D-4107, Leipzig, Germany.
| | | | | |
Collapse
|
10
|
Lahti L, Elo LL, Aittokallio T, Kaski S. Probabilistic analysis of probe reliability in differential gene expression studies with short oligonucleotide arrays. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:217-225. [PMID: 21071809 DOI: 10.1109/tcbb.2009.38] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Probe defects are a major source of noise in gene expression studies. While existing approaches detect noisy probes based on external information such as genomic alignments, we introduce and validate a targeted probabilistic method for analyzing probe reliability directly from expression data and independently of the noise source. This provides insights into the various sources of probe-level noise and gives tools to guide probe design.
Collapse
Affiliation(s)
- Leo Lahti
- Helsinki Institute for Information Technology, Department of Information and Computer Science, Aalto University School of Science and Technology, PO Box 15400, FI-00076 Aalto, Finland.
| | | | | | | |
Collapse
|
11
|
Gupta R, Greco D, Auvinen P, Arjas E. Bayesian integrated modeling of expression data: a case study on RhoG. BMC Bioinformatics 2010; 11:295. [PMID: 20515463 PMCID: PMC2894040 DOI: 10.1186/1471-2105-11-295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2009] [Accepted: 06/01/2010] [Indexed: 11/02/2022] Open
Abstract
BACKGROUND DNA microarrays provide an efficient method for measuring activity of genes in parallel and even covering all the known transcripts of an organism on a single array. This has to be balanced against that analyzing data emerging from microarrays involves several consecutive steps, and each of them is a potential source of errors. Errors tend to accumulate when moving from the lower level towards the higher level analyses because of the sequential nature. Eliminating such errors does not seem feasible without completely changing the technologies, but one should nevertheless try to meet the goal of being able to realistically assess degree of the uncertainties that are involved when drawing the final conclusions from such analyses. RESULTS We present a Bayesian hierarchical model for finding differentially expressed genes between two experimental conditions, proposing an integrated statistical approach where correcting signal saturation, systematic array effects, dye effects, and finding differentially expressed genes, are all modeled jointly. The integration allows all these components, and also the associated errors, to be considered simultaneously. The inference is based on full posterior distribution of gene expression indices and on quantities derived from them rather than on point estimates. The model was applied and tested on two different datasets. CONCLUSIONS The method presents a way of integrating various steps of microarray analysis into a single joint analysis, and thereby enables extracting information on differential expression in a manner, which properly accounts for various sources of potential error in the process.
Collapse
Affiliation(s)
- Rashi Gupta
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | | | | | |
Collapse
|
12
|
Turro E, Lewin A, Rose A, Dallman MJ, Richardson S. MMBGX: a method for estimating expression at the isoform level and detecting differential splicing using whole-transcript Affymetrix arrays. Nucleic Acids Res 2009; 38:e4. [PMID: 19854940 PMCID: PMC2800219 DOI: 10.1093/nar/gkp853] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Affymetrix has recently developed whole-transcript GeneChips-'Gene' and 'Exon' arrays-which interrogate exons along the length of each gene. Although each probe on these arrays is intended to hybridize perfectly to only one transcriptional target, many probes match multiple transcripts located in different parts of the genome or alternative isoforms of the same gene. Existing statistical methods for estimating expression do not take this into account and are thus prone to producing inflated estimates. We propose a method, Multi-Mapping Bayesian Gene eXpression (MMBGX), which disaggregates the signal at 'multi-match' probes. When applied to Gene arrays, MMBGX removes the upward bias of gene-level expression estimates. When applied to Exon arrays, it can further disaggregate the signal between alternative transcripts of the same gene, providing expression estimates of individual splice variants. We demonstrate the performance of MMBGX on simulated data and a tissue mixture data set. We then show that MMBGX can estimate the expression of alternative isoforms within one experimental condition, confirming our results by RT-PCR. Finally, we show that our method for detecting differential splicing has a lower error rate than standard exon-level approaches on a previously validated colon cancer data set.
Collapse
Affiliation(s)
- Ernest Turro
- Department of Epidemiology and Public Health, Imperial College London, London, UK.
| | | | | | | | | |
Collapse
|
13
|
Jailwala P, Waukau J, Glisic S, Jana S, Ehlenbach S, Hessner M, Alemzadeh R, Matsuyama S, Laud P, Wang X, Ghosh S. Apoptosis of CD4+ CD25(high) T cells in type 1 diabetes may be partially mediated by IL-2 deprivation. PLoS One 2009; 4:e6527. [PMID: 19654878 PMCID: PMC2716541 DOI: 10.1371/journal.pone.0006527] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Accepted: 07/02/2009] [Indexed: 01/26/2023] Open
Abstract
Background Type 1 diabetes (T1D) is a T-cell mediated autoimmune disease targeting the insulin-producing pancreatic β cells. Naturally occurring FOXP3+CD4+CD25high regulatory T cells (Tregs) play an important role in dominant tolerance, suppressing autoreactive CD4+ effector T cell activity. Previously, in both recent-onset T1D patients and β cell antibody-positive at-risk individuals, we observed increased apoptosis and decreased function of polyclonal Tregs in the periphery. Our objective here was to elucidate the genes and signaling pathways triggering apoptosis in Tregs from T1D subjects. Principal Findings Gene expression profiles of unstimulated Tregs from recent-onset T1D (n = 12) and healthy control subjects (n = 15) were generated. Statistical analysis was performed using a Bayesian approach that is highly efficient in determining differentially expressed genes with low number of replicate samples in each of the two phenotypic groups. Microarray analysis showed that several cytokine/chemokine receptor genes, HLA genes, GIMAP family genes and cell adhesion genes were downregulated in Tregs from T1D subjects, relative to control subjects. Several downstream target genes of the AKT and p53 pathways were also upregulated in T1D subjects, relative to controls. Further, expression signatures and increased apoptosis in Tregs from T1D subjects partially mirrored the response of healthy Tregs under conditions of IL-2 deprivation. CD4+ effector T-cells from T1D subjects showed a marked reduction in IL-2 secretion. This could indicate that prior to and during the onset of disease, Tregs in T1D may be caught up in a relatively deficient cytokine milieu. Conclusions In summary, expression signatures in Tregs from T1D subjects reflect a cellular response that leads to increased sensitivity to apoptosis, partially due to cytokine deprivation. Further characterization of these signaling cascades should enable the detection of genes that can be targeted for restoring Treg function in subjects predisposed to T1D.
Collapse
Affiliation(s)
- Parthav Jailwala
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Jill Waukau
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Sanja Glisic
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Srikanta Jana
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Sarah Ehlenbach
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Martin Hessner
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Ramin Alemzadeh
- Children's Hospital of Wisconsin Diabetes Center, Pediatric Endocrinology and Metabolism, Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Shigemi Matsuyama
- Department of Pharmacology, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Purushottam Laud
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Xujing Wang
- Department of Physics & the Comprehensive Diabetes Center, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Soumitra Ghosh
- The Max McGee National Research Center for Juvenile Diabetes and The Human and Molecular Genetics Center, Department of Pediatrics at the Medical College of Wisconsin and the Children's Research Institute of the Children's Hospital of Wisconsin, Milwaukee, Wisconsin, United States of America
- * E-mail:
| |
Collapse
|
14
|
Chen Z, McGee M, Liu Q, Kong M, Deng Y, Scheuermann RH. A Distribution-Free Convolution Model for background correction of oligonucleotide microarray data. BMC Genomics 2009; 10 Suppl 1:S19. [PMID: 19594878 PMCID: PMC2709262 DOI: 10.1186/1471-2164-10-s1-s19] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Introduction Affymetrix GeneChip® high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions. Results We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background. Conclusion Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.
Collapse
Affiliation(s)
- Zhongxue Chen
- Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, The University of Texas Health Science Center at Houston, UT Professional Building, Houston, TX 77030, USA.
| | | | | | | | | | | |
Collapse
|
15
|
Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet 2009; 10:122-33. [PMID: 19139763 DOI: 10.1038/nrg2509] [Citation(s) in RCA: 298] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Two related developments are currently changing traditional approaches to computational systems biology modelling. First, stochastic models are being used increasingly in preference to deterministic models to describe biochemical network dynamics at the single-cell level. Second, sophisticated statistical methods and algorithms are being used to fit both deterministic and stochastic models to time course and other experimental data. Both frameworks are needed to adequately describe observed noise, variability and heterogeneity of biological systems over a range of scales of biological organization.
Collapse
|
16
|
Blangiardo M, Richardson S. A Bayesian calibration model for combining different pre-processing methods in Affymetrix chips. BMC Bioinformatics 2008; 9:512. [PMID: 19046434 PMCID: PMC2639433 DOI: 10.1186/1471-2105-9-512] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2008] [Accepted: 12/01/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In gene expression studies a key role is played by the so called "pre-processing", a series of steps designed to extract the signal and account for the sources of variability due to the technology used rather than to biological differences between the RNA samples. At the moment there is no commonly agreed gold standard pre-processing method and each researcher has the responsibility to choose one method, incurring the risk of false positive and false negative features arising from the particular method chosen. RESULTS We propose a Bayesian calibration model that makes use of the information provided by several pre-processing methods and we show that this model gives a better assessment of the 'true' unknown differential expression between two conditions. We demonstrate how to estimate the posterior distribution of the differential expression values of interest from the combined information. CONCLUSION On simulated data and on the spike-in Latin Square dataset from Affymetrix the Bayesian calibration model proves to have more power than each pre-processing method. Its biological interest is demonstrated through an experimental example on publicly available data.
Collapse
Affiliation(s)
- Marta Blangiardo
- Centre for Biostatistics, Imperial College, St Mary's Campus, Norfolk Place, London, UK.
| | | |
Collapse
|
17
|
Fostel J. Toxicogenomics Data and Databases. Genomics 2008. [DOI: 10.3109/9781420067064-15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
18
|
Astrand M, Mostad P, Rudemo M. Empirical Bayes models for multiple probe type microarrays at the probe level. BMC Bioinformatics 2008; 9:156. [PMID: 18366694 PMCID: PMC2358895 DOI: 10.1186/1471-2105-9-156] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2007] [Accepted: 03/20/2008] [Indexed: 12/02/2022] Open
Abstract
Background When analyzing microarray data a primary objective is often to find differentially expressed genes. With empirical Bayes and penalized t-tests the sample variances are adjusted towards a global estimate, producing more stable results compared to ordinary t-tests. However, for Affymetrix type data a clear dependency between variability and intensity-level generally exists, even for logged intensities, most clearly for data at the probe level but also for probe-set summarizes such as the MAS5 expression index. As a consequence, adjustment towards a global estimate results in an intensity-level dependent false positive rate. Results We propose two new methods for finding differentially expressed genes, Probe level Locally moderated Weighted median-t (PLW) and Locally Moderated Weighted-t (LMW). Both methods use an empirical Bayes model taking the dependency between variability and intensity-level into account. A global covariance matrix is also used allowing for differing variances between arrays as well as array-to-array correlations. PLW is specially designed for Affymetrix type arrays (or other multiple-probe arrays). Instead of making inference on probe-set summaries, comparisons are made separately for each perfect-match probe and are then summarized into one score for the probe-set. Conclusion The proposed methods are compared to 14 existing methods using five spike-in data sets. For RMA and GCRMA processed data, PLW has the most accurate ranking of regulated genes in four out of the five data sets, and LMW consistently performs better than all examined moderated t-tests when used on RMA, GCRMA, and MAS5 expression indexes.
Collapse
Affiliation(s)
- Magnus Astrand
- Mathematical Sciences, Chalmers University of Technology, and Mathematical Sciences, Göteborg University, S-41296, Göteborg, Sweden.
| | | | | |
Collapse
|
19
|
Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference 2008; 138:387-404. [PMID: 19672331 DOI: 10.1016/j.jspi.2007.06.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
One of the fundamental issues in analyzing microarray data is to determine which genes are expressed and which ones are not for a given group of subjects. In datasets where many genes are expressed and many are not expressed (i.e., underexpressed), a bimodal distribution for the gene expression levels often results, where one mode of the distribution represents the expressed genes and the other mode represents the underexpressed genes. To model this bimodality, we propose a new class of mixture models that utilize a random threshold value for accommodating bimodality in the gene expression distribution. Theoretical properties of the proposed model are carefully examined. We use this new model to examine the problem of differential gene expression between two groups of subjects, develop prior distributions, and derive a new criterion for determining which genes are differentially expressed between the two groups. Prior elicitation is carried out using empirical Bayes methodology in order to estimate the threshold value as well as elicit the hyperparameters for the two component mixture model. The new gene selection criterion is demonstrated via several simulations to have excellent false positive rate and false negative rate properties. A gastric cancer dataset is used to motivate and illustrate the proposed methodology.
Collapse
Affiliation(s)
- Ming-Hui Chen
- Department of Statistics, University of Connecticut, Storrs, CT 06269, USA
| | | | | |
Collapse
|
20
|
Rattray M, Liu X, Sanguinetti G, Milo M, Lawrence ND. Propagating uncertainty in microarray data analysis. Brief Bioinform 2008; 7:37-47. [PMID: 16761363 DOI: 10.1093/bib/bbk003] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Microarray technology is associated with many sources of experimental uncertainty. In this review we discuss a number of approaches for dealing with this uncertainty in the processing of data from microarray experiments. We focus here on the analysis of high-density oligonucleotide arrays, such as the popular Affymetrix GeneChip array, which contain multiple probes for each target. This set of probes can be used to determine an estimate for the target concentration and can also be used to determine the experimental uncertainty associated with this measurement. This measurement uncertainty can then be propagated through the downstream analysis using probabilistic methods. We give examples showing how these credibility intervals can be used to help identify differential expression, to combine information from replicated experiments and to improve the performance of principal component analysis.
Collapse
Affiliation(s)
- Magnus Rattray
- School of Computer Science, University of Manchester, Manchester M13 9PL, UK.
| | | | | | | | | |
Collapse
|
21
|
Wu Z, Irizarry RA. A statistical framework for the analysis of microarray probe-level data. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas116] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Turro E, Bochkina N, Hein AMK, Richardson S. BGX: a Bioconductor package for the Bayesian integrated analysis of Affymetrix GeneChips. BMC Bioinformatics 2007; 8:439. [PMID: 17997843 PMCID: PMC2216047 DOI: 10.1186/1471-2105-8-439] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2007] [Accepted: 11/12/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Affymetrix 3' GeneChip microarrays are widely used to profile the expression of thousands of genes simultaneously. They differ from many other microarray types in that GeneChips are hybridised using a single labelled extract and because they contain multiple 'match' and 'mismatch' sequences for each transcript. Most algorithms extract the signal from GeneChip experiments in a sequence of separate steps, including background correction and normalisation, which inhibits the simultaneous use of all available information. They principally provide a point estimate of gene expression and, in contrast to BGX, do not fully integrate the uncertainty arising from potentially heterogeneous responses of the probes. RESULTS BGX is a new Bioconductor R package that implements an integrated Bayesian approach to the analysis of 3' GeneChip data. The software takes into account additive and multiplicative error, non-specific hybridisation and replicate summarisation in the spirit of the model outlined in 1. It also provides a posterior distribution for the expression of each gene. Moreover, BGX can take into account probe affinity effects from probe sequence information where available. The package employs a novel adaptive Markov chain Monte Carlo (MCMC) algorithm that raises considerably the efficiency with which the posterior distributions are sampled from. Finally, BGX incorporates various ways to analyse the results, such as ranking genes by expression level as well as statistically based methods for estimating the amount of up and down regulated genes between two conditions. CONCLUSION BGX performs well relative to other widely used methods at estimating expression levels and fold changes. It has the advantage that it provides a statistically sound measure of uncertainty for its estimates. BGX includes various analysis functions to visualise and exploit the rich output that is produced by the Bayesian model.
Collapse
Affiliation(s)
- Ernest Turro
- Centre for Biostatistics, Imperial College London, UK.
| | | | | | | |
Collapse
|
23
|
Fostel JM. Future of toxicogenomics and safety signatures: balancing public access to data with proprietary drug discovery. Pharmacogenomics 2007; 8:425-30. [PMID: 17465705 DOI: 10.2217/14622416.8.5.425] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
|
24
|
Including probe-level uncertainty in model-based gene expression clustering. BMC Bioinformatics 2007; 8:98. [PMID: 17376221 PMCID: PMC1847531 DOI: 10.1186/1471-2105-8-98] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2006] [Accepted: 03/21/2007] [Indexed: 11/30/2022] Open
Abstract
Background Clustering is an important analysis performed on microarray gene expression data since it groups genes which have similar expression patterns and enables the exploration of unknown gene functions. Microarray experiments are associated with many sources of experimental and biological variation and the resulting gene expression data are therefore very noisy. Many heuristic and model-based clustering approaches have been developed to cluster this noisy data. However, few of them include consideration of probe-level measurement error which provides rich information about technical variability. Results We augment a standard model-based clustering method to incorporate probe-level measurement error. Using probe-level measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we include the probe-level measurement error directly into the standard Gaussian mixture model. Our augmented model is shown to provide improved clustering performance on simulated datasets and a real mouse time-course dataset. Conclusion The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically meaningful clustering results are obtained.
Collapse
|
25
|
Ahmed FE. Microarray RNA transcriptional profiling: part II. Analytical considerations and annotation. Expert Rev Mol Diagn 2006; 6:703-15. [PMID: 17009905 DOI: 10.1586/14737159.6.5.703] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
This review summarizes the various data filtration, transformation and normalization processes for different array platforms (cDNA, oligos, one- and two-color), data analysis methods and their validation, and databases and annotation for RNA transcriptional profiling microarrays. This review is intended to introduce the beginner to the analyses and interpretation of gene expression studies using a nonmathematical approach for easier comprehension. Microarray analysis is not a trivial undertaking as there is no single method that works well for all, and results obtained from these analyses should be considered as a complement to other approaches.
Collapse
Affiliation(s)
- Farid E Ahmed
- Clinical Professor, East Carolina University, Department of Radiation Oncology, LSB 014, Leo W. Jenkins Cancer Center, The Brody School of Medicine, Greenville, NC 27858, USA.
| |
Collapse
|
26
|
A hierarchical Naïve Bayes Model for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC Bioinformatics 2006; 7:514. [PMID: 17125514 PMCID: PMC1698579 DOI: 10.1186/1471-2105-7-514] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2006] [Accepted: 11/24/2006] [Indexed: 11/10/2022] Open
Abstract
Background Uncertainty often affects molecular biology experiments and data for different reasons. Heterogeneity of gene or protein expression within the same tumor tissue is an example of biological uncertainty which should be taken into account when molecular markers are used in decision making. Tissue Microarray (TMA) experiments allow for large scale profiling of tissue biopsies, investigating protein patterns characterizing specific disease states. TMA studies deal with multiple sampling of the same patient, and therefore with multiple measurements of same protein target, to account for possible biological heterogeneity. The aim of this paper is to provide and validate a classification model taking into consideration the uncertainty associated with measuring replicate samples. Results We propose an extension of the well-known Naïve Bayes classifier, which accounts for biological heterogeneity in a probabilistic framework, relying on Bayesian hierarchical models. The model, which can be efficiently learned from the training dataset, exploits a closed-form of classification equation, thus providing no additional computational cost with respect to the standard Naïve Bayes classifier. We validated the approach on several simulated datasets comparing its performances with the Naïve Bayes classifier. Moreover, we demonstrated that explicitly dealing with heterogeneity can improve classification accuracy on a TMA prostate cancer dataset. Conclusion The proposed Hierarchical Naïve Bayes classifier can be conveniently applied in problems where within sample heterogeneity must be taken into account, such as TMA experiments and biological contexts where several measurements (replicates) are available for the same biological sample. The performance of the new approach is better than the standard Naïve Bayes model, in particular when the within sample heterogeneity is different in the different classes.
Collapse
|
27
|
Abstract
We consider a new frequentist gene expression index for Affymetrix oligonucleotide DNA arrays, using a similar probe intensity model as suggested by Hein and others (2005), called the Bayesian gene expression index (BGX). According to this model, the perfect match and mismatch values are assumed to be correlated as a result of sharing a common gene expression signal. Rather than a Bayesian approach, we develop a maximum likelihood algorithm for estimating the underlying common signal. In this way, estimation is explicit and much faster than the BGX implementation. The observed Fisher information matrix, rather than a posterior credibility interval, gives an idea of the accuracy of the estimators. We evaluate our method using benchmark spike-in data sets from Affymetrix and GeneLogic by analyzing the relationship between estimated signal and concentration, i.e. true signal, and compare our results with other commonly used methods.
Collapse
Affiliation(s)
- Vilda Purutçuoglu
- Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
| | | |
Collapse
|
28
|
Abstract
We present a Bayesian hierarchical model for detecting differentially expressing genes that includes simultaneous estimation of array effects, and show how to use the output for choosing lists of genes for further investigation. We give empirical evidence that expression-level dependent array effects are needed, and explore different nonlinear functions as part of our model-based approach to normalization. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. Modeling the array effects (normalization) simultaneously with differential expression gives fewer false positive results. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is used to pick the list of genes, thereby taking into account uncertainty in parameter estimates. In an application to mouse knockout data, Gene Ontology annotations over- and underrepresented among the genes on the chosen list are consistent with biological expectations.
Collapse
Affiliation(s)
- Alex Lewin
- Department of Epidemiology and Public Health, Imperial College, Norfolk Place, London W2 1PG, UK.
| | | | | | | | | |
Collapse
|
29
|
Hein AMK, Richardson S. A powerful method for detecting differentially expressed genes from GeneChip arrays that does not require replicates. BMC Bioinformatics 2006; 7:353. [PMID: 16857053 PMCID: PMC1586027 DOI: 10.1186/1471-2105-7-353] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2006] [Accepted: 07/20/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Studies of differential expression that use Affymetrix GeneChip arrays are often carried out with a limited number of replicates. Reasons for this include financial considerations and limits on the available amount of RNA for sample preparation. In addition, failed hybridizations are not uncommon leading to a further reduction in the number of replicates available for analysis. Most existing methods for studying differential expression rely on the availability of replicates and the demand for alternative methods that require few or no replicates is high. RESULTS We describe a statistical procedure for performing differential expression analysis without replicates. The procedure relies on a Bayesian integrated approach (BGX) to the analysis of Affymetrix GeneChips. The BGX method estimates a posterior distribution of expression for each gene and condition, from a simultaneous consideration of the available probe intensities representing the gene in a condition. Importantly, posterior distributions of expression are obtained regardless of the number of replicates available. We exploit these posterior distributions to create ranked gene lists that take into account the estimated expression difference as well as its associated uncertainty. We estimate the proportion of non-differentially expressed genes empirically, allowing an informed choice of cut-off for the ranked gene list, adapting an approach proposed by Efron. We assess the performance of the method, and compare it to those of other methods, on publicly available spike-in data sets, as well as in a proper biological setting. CONCLUSION The method presented is a powerful tool for extracting information on differential expression from GeneChip expression studies with limited or no replicates.
Collapse
Affiliation(s)
- Anne-Mette K Hein
- Dept. of Epidemiology and Public Health, Imperial College London, Norfolk Place, London, UK
| | - Sylvia Richardson
- Dept. of Epidemiology and Public Health, Imperial College London, Norfolk Place, London, UK
| |
Collapse
|
30
|
Nykter M, Aho T, Ahdesmäki M, Ruusuvuori P, Lehmussola A, Yli-Harja O. Simulation of microarray data with realistic characteristics. BMC Bioinformatics 2006; 7:349. [PMID: 16848902 PMCID: PMC1574357 DOI: 10.1186/1471-2105-7-349] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2005] [Accepted: 07/18/2006] [Indexed: 02/07/2023] Open
Abstract
Background Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed. Results We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples. Conclusion The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms.
Collapse
Affiliation(s)
- Matti Nykter
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Tommi Aho
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Miika Ahdesmäki
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Pekka Ruusuvuori
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Antti Lehmussola
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Olli Yli-Harja
- Institute of Signal Processing, Tampere University of Technology, Tampere, Finland
| |
Collapse
|
31
|
Liu X, Milo M, Lawrence ND, Rattray M. Probe-level measurement error improves accuracy in detecting differential gene expression. Bioinformatics 2006; 22:2107-13. [PMID: 16820429 DOI: 10.1093/bioinformatics/btl361] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Finding differentially expressed genes is a fundamental objective of a microarray experiment. Numerous methods have been proposed to perform this task. Existing methods are based on point estimates of gene expression level obtained from each microarray experiment. This approach discards potentially useful information about measurement error that can be obtained from an appropriate probe-level analysis. Probabilistic probe-level models can be used to measure gene expression and also provide a level of uncertainty in this measurement. This probe-level measurement error provides useful information which can help in the identification of differentially expressed genes. RESULTS We propose a Bayesian method to include probe-level measurement error into the detection of differentially expressed genes from replicated experiments. A variational approximation is used for efficient parameter estimation. We compare this approximation with MAP and MCMC parameter estimation in terms of computational efficiency and accuracy. The method is used to calculate the probability of positive log-ratio (PPLR) of expression levels between conditions. Using the measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we test PPLR on a spike-in dataset and a mouse time-course dataset. Results show that the inclusion of probe-level measurement error improves accuracy in detecting differential gene expression. AVAILABILITY The MAP approximation and variational inference described in this paper have been implemented in an R package pplr. The MCMC method is implemented in Matlab. Both software are available from http://umber.sbs.man.ac.uk/resources/puma.
Collapse
Affiliation(s)
- Xuejun Liu
- School of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK
| | | | | | | |
Collapse
|
32
|
Liu X, Milo M, Lawrence ND, Rattray M. A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics 2005; 21:3637-44. [PMID: 16020470 DOI: 10.1093/bioinformatics/bti583] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Affymetrix GeneChip arrays are currently the most widely used microarray technology. Many summarization methods have been developed to provide gene expression levels from Affymetrix probe-level data. Most of the currently popular methods do not provide a measure of uncertainty for the expression level of each gene. The use of probabilistic models can overcome this limitation. A full hierarchical Bayesian approach requires the use of computationally intensive MCMC methods that are impractical for large datasets. An alternative computationally efficient probabilistic model, mgMOS, uses Gamma distributions to model specific and non-specific binding with a latent variable to capture variations in probe affinity. Although promising, the main limitations of this model are that it does not use information from multiple chips and does not account for specific binding to the mismatch (MM) probes. RESULTS We extend mgMOS to model the binding affinity of probe-pairs across multiple chips and to capture the effect of specific binding to MM probes. The new model, multi-mgMOS, provides improved accuracy, as demonstrated on some bench-mark datasets and a real time-course dataset, and is much more computationally efficient than a competing hierarchical Bayesian approach that requires MCMC sampling. We demonstrate how the probabilistic model can be used to estimate credibility intervals for expression levels and their log-ratios between conditions. AVAILABILITY Both mgMOS and the new model multi-mgMOS have been implemented in an R package, which is available at http://www.bioinf.man.ac.uk/resources/puma.
Collapse
Affiliation(s)
- Xuejun Liu
- School of Computer Science, University of Manchester, Manchester M13 9PL, UK
| | | | | | | |
Collapse
|