1
|
Liang Y, Kelemen A, Kelemen A. Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0039. [PMID: 31077580 DOI: 10.1515/sagmb-2018-0039] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore, MD 21201-1579, USA
| | - Adam Kelemen
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
| | - Arpad Kelemen
- Department of Organizational Systems and Adult Health, University of Maryland, Baltimore, MD 21201-1579, USA
| |
Collapse
|
2
|
Liang Y, Kelemen A. Dynamic modeling and network approaches for omics time course data: overview of computational approaches and applications. Brief Bioinform 2019; 19:1051-1068. [PMID: 28430854 DOI: 10.1093/bib/bbx036] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Indexed: 12/23/2022] Open
Abstract
Inferring networks and dynamics of genes, proteins, cells and other biological entities from high-throughput biological omics data is a central and challenging issue in computational and systems biology. This is essential for understanding the complexity of human health, disease susceptibility and pathogenesis for Predictive, Preventive, Personalized and Participatory (P4) system and precision medicine. The delineation of the possible interactions of all genes/proteins in a genome/proteome is a task for which conventional experimental techniques are ill suited. Urgently needed are rapid and inexpensive computational and statistical methods that can identify interacting candidate disease genes or drug targets out of thousands that can be further investigated or validated by experimentations. Moreover, identifying biological dynamic systems, and simultaneously estimating the important kinetic structural and functional parameters, which may not be experimentally accessible could be important directions for drug-disease-gene network studies. In this article, we present an overview and comparison of recent developments of dynamic modeling and network approaches for time-course omics data, and their applications to various biological systems, health conditions and disease statuses. Moreover, various data reduction and analytical schemes ranging from mathematical to computational to statistical methods are compared including their merits, drawbacks and limitations. The most recent software, associated web resources and other potentials for the compared methods are also presented and discussed in detail.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore, MD, USA
| | - Arpad Kelemen
- Department of Family and Community Health, University of Maryland, Baltimore, MD, USA
| |
Collapse
|
3
|
Santra T, Roche S, Conlon N, O’Donovan N, Crown J, O’Connor R, Kolch W. Identification of potential new treatment response markers and therapeutic targets using a Gaussian process-based method in lapatinib insensitive breast cancer models. PLoS One 2017; 12:e0177058. [PMID: 28481952 PMCID: PMC5421758 DOI: 10.1371/journal.pone.0177058] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 04/23/2017] [Indexed: 12/15/2022] Open
Abstract
Molecularly targeted therapeutics hold promise of revolutionizing treatments of advanced malignancies. However, a large number of patients do not respond to these treatments. Here, we take a systems biology approach to understand the molecular mechanisms that prevent breast cancer (BC) cells from responding to lapatinib, a dual kinase inhibitor that targets human epidermal growth factor receptor 2 (HER2) and epidermal growth factor receptor (EGFR). To this end, we analysed temporal gene expression profiles of four BC cell lines, two of which respond and the remaining two do not respond to lapatinib. For this analysis, we developed a Gaussian process based algorithm which can accurately find differentially expressed genes by analysing time course gene expression profiles at a fraction of the computational cost of other state-of-the-art algorithms. Our analysis identified 519 potential genes which are characteristic of lapatinib non-responsiveness in the tested cell lines. Data from the Genomics of Drug Sensitivity in Cancer (GDSC) database suggested that the basal expressions 120 of the above genes correlate with the response of BC cells to HER2 and/or EGFR targeted therapies. We selected 27 genes from the larger panel of 519 genes for experimental verification and 16 of these were successfully validated. Further bioinformatics analysis identified vitamin D receptor (VDR) as a potential target of interest for lapatinib non-responsive BC cells. Experimentally, calcitriol, a commonly used reagent for VDR targeted therapy, in combination with lapatinib additively inhibited proliferation in two HER2 positive cell lines, lapatinib insensitive MDA-MB-453 and lapatinib resistant HCC 1954-L cells.
Collapse
Affiliation(s)
- Tapesh Santra
- Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland
- * E-mail:
| | - Sandra Roche
- National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
| | - Neil Conlon
- National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
| | - Norma O’Donovan
- National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
| | - John Crown
- National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
- Department of Medical Oncology, St Vincent’s University Hospital, Dublin, Elm Park, Ireland
| | - Robert O’Connor
- National Institute for Cellular Biotechnology, Dublin City University, Dublin, Ireland
| | - Walter Kolch
- Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland
- Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin, Ireland
- School of Medicine, University College Dublin, Belfield, Dublin, Ireland
| |
Collapse
|
4
|
Identification of Cancer Related Genes Using a Comprehensive Map of Human Gene Expression. PLoS One 2016; 11:e0157484. [PMID: 27322383 PMCID: PMC4913919 DOI: 10.1371/journal.pone.0157484] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 05/30/2016] [Indexed: 11/19/2022] Open
Abstract
Rapid accumulation and availability of gene expression datasets in public repositories have enabled large-scale meta-analyses of combined data. The richness of cross-experiment data has provided new biological insights, including identification of new cancer genes. In this study, we compiled a human gene expression dataset from ∼40,000 publicly available Affymetrix HG-U133Plus2 arrays. After strict quality control and data normalisation the data was quantified in an expression matrix of ∼20,000 genes and ∼28,000 samples. To enable different ways of sample grouping, existing annotations where subjected to systematic ontology assisted categorisation and manual curation. Groups like normal tissues, neoplasmic tissues, cell lines, homoeotic cells and incompletely differentiated cells were created. Unsupervised analysis of the data confirmed global structure of expression consistent with earlier analysis but with more details revealed due to increased resolution. A suitable mixed-effects linear model was used to further investigate gene expression in solid tissue tumours, and to compare these with the respective healthy solid tissues. The analysis identified 1,285 genes with systematic expression change in cancer. The list is significantly enriched with known cancer genes from large, public, peer-reviewed databases, whereas the remaining ones are proposed as new cancer gene candidates. The compiled dataset is publicly available in the ArrayExpress Archive. It contains the most diverse collection of biological samples, making it the largest systematically annotated gene expression dataset of its kind in the public domain.
Collapse
|
5
|
Lindberg RLP, Kappos L. Transcriptional profiling of multiple sclerosis: towards improved diagnosis and treatment. Expert Rev Mol Diagn 2014; 6:843-55. [PMID: 17140371 DOI: 10.1586/14737159.6.6.843] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The development of high-throughput techniques, for example cDNA and oligonucleotide microarrays, for simultaneous analysis of the transcriptional expression of thousands of genes, even the entire genome, has provided new possibilities to get better insights into the pathogenesis of various diseases. This technology has also been applied to define biomarkers and, most importantly, possible new candidate targets for novel treatments. In multiple sclerosis, microarray studies have been performed on brain autopsy and biopsy specimens and peripheral blood. The effects of current treatments for multiple sclerosis, especially interferon-beta and glatiramer acetate, on transcriptional profiles, have also been investigated. We review the main findings revealed from these studies. The emerging potential of microarray technology to define gene signatures, diagnostic and prognostic markers for disease course, and treatment response in multiple sclerosis will be discussed.
Collapse
Affiliation(s)
- Raija L P Lindberg
- Outpatient Clinic Neurology-Neurosurgery and Department of Research, Pharmazentrum University Hospital Basel, Petersgraben 4, 4031 Basel, Switzerland.
| | | |
Collapse
|
6
|
Time series expression analyses using RNA-seq: a statistical approach. BIOMED RESEARCH INTERNATIONAL 2013; 2013:203681. [PMID: 23586021 PMCID: PMC3622290 DOI: 10.1155/2013/203681] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2012] [Revised: 01/10/2013] [Accepted: 01/15/2013] [Indexed: 11/29/2022]
Abstract
RNA-seq is becoming the de facto standard approach for transcriptome analysis with ever-reducing cost. It has considerable advantages over conventional technologies (microarrays) because it allows for direct identification and quantification of transcripts. Many time series RNA-seq datasets have been collected to study the dynamic regulations of transcripts. However, statistically rigorous and computationally efficient methods are needed to explore the time-dependent changes of gene expression in biological systems. These methods should explicitly account for the dependencies of expression patterns across time points. Here, we discuss several methods that can be applied to model timecourse RNA-seq data, including statistical evolutionary trajectory index (SETI), autoregressive time-lagged regression (AR(1)), and hidden Markov model (HMM) approaches. We use three real datasets and simulation studies to demonstrate the utility of these dynamic methods in temporal analysis.
Collapse
|
7
|
Olex AL, Hiltbold EM, Leng X, Fetrow JS. Dynamics of dendritic cell maturation are identified through a novel filtering strategy applied to biological time-course microarray replicates. BMC Immunol 2010; 11:41. [PMID: 20682054 PMCID: PMC2928180 DOI: 10.1186/1471-2172-11-41] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2009] [Accepted: 08/03/2010] [Indexed: 01/04/2023] Open
Abstract
Background Dendritic cells (DC) play a central role in primary immune responses and become potent stimulators of the adaptive immune response after undergoing the critical process of maturation. Understanding the dynamics of DC maturation would provide key insights into this important process. Time course microarray experiments can provide unique insights into DC maturation dynamics. Replicate experiments are necessary to address the issues of experimental and biological variability. Statistical methods and averaging are often used to identify significant signals. Here a novel strategy for filtering of replicate time course microarray data, which identifies consistent signals between the replicates, is presented and applied to a DC time course microarray experiment. Results The temporal dynamics of DC maturation were studied by stimulating DC with poly(I:C) and following gene expression at 5 time points from 1 to 24 hours. The novel filtering strategy uses standard statistical and fold change techniques, along with the consistency of replicate temporal profiles, to identify those differentially expressed genes that were consistent in two biological replicate experiments. To address the issue of cluster reproducibility a consensus clustering method, which identifies clusters of genes whose expression varies consistently between replicates, was also developed and applied. Analysis of the resulting clusters revealed many known and novel characteristics of DC maturation, such as the up-regulation of specific immune response pathways. Intriguingly, more genes were down-regulated than up-regulated. Results identify a more comprehensive program of down-regulation, including many genes involved in protein synthesis, metabolism, and housekeeping needed for maintenance of cellular integrity and metabolism. Conclusions The new filtering strategy emphasizes the importance of consistent and reproducible results when analyzing microarray data and utilizes consistency between replicate experiments as a criterion in both feature selection and clustering, without averaging or otherwise combining replicate data. Observation of a significant down-regulation program during DC maturation indicates that DC are preparing for cell death and provides a path to better understand the process. This new filtering strategy can be adapted for use in analyzing other large-scale time course data sets with replicates.
Collapse
Affiliation(s)
- Amy L Olex
- Department of Computer Science, Wake Forest University, Winston-Salem, NC 27109, USA
| | | | | | | |
Collapse
|
8
|
Farnsworth A, Flaman AS, Prasad SS, Gravel C, Williams A, Yauk CL, Li X. Acetaminophen modulates the transcriptional response to recombinant interferon-beta. PLoS One 2010; 5:e11031. [PMID: 20544007 PMCID: PMC2882945 DOI: 10.1371/journal.pone.0011031] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2009] [Accepted: 05/13/2010] [Indexed: 12/17/2022] Open
Abstract
Background Recombinant interferon treatment can result in several common side effects including fever and injection-site pain. Patients are often advised to use acetaminophen or other over-the-counter pain medications as needed. Little is known regarding the transcriptional changes induced by such co-administration. Methodology/Principal Findings We tested whether the administration of acetaminophen causes a change in the response normally induced by interferon-β treatment. CD-1 mice were administered acetaminophen (APAP), interferon-β (IFN-β) or a combination of IFN-β+APAP and liver and serum samples were collected for analysis. Differential gene expression was determined using an Agilent 22 k whole mouse genome microarray. Data were analyzed by several methods including Gene Ontology term clustering and Gene Set Enrichment Analysis. We observed a significant change in the transcription profile of hepatic cells when APAP was co-administered with IFN-β. These transcriptional changes included a marked up-regulation of genes involved in signal transduction and cell differentiation and down-regulation of genes involved in cellular metabolism, trafficking and the IκBK/NF-κB cascade. Additionally, we observed a large decrease in the expression of several IFN-induced genes including Ifit-3, Isg-15, Oasl1, Zbp1 and predicted gene EG634650 at both early and late time points. Conclusions/Significance A significant change in the transcriptional response was observed following co-administration of IFN-β+APAP relative to IFN-β treatment alone. These results suggest that administration of acetaminophen has the potential to modify the efficacy of IFN-β treatment.
Collapse
Affiliation(s)
- Aaron Farnsworth
- Centre for Vaccine Evaluation, Biologics and Genetic Therapies Directorate, Health Canada, Ottawa, Ontario, Canada.
| | | | | | | | | | | | | |
Collapse
|
9
|
Abstract
In this chapter, we discuss a number of approaches to network inference from large-scale functional genomics data. Our goal is to describe current methods that can be used to infer predictive networks. At present, one of the most effective methods to produce networks with predictive value is the Bayesian network approach. This approach was initially instantiated by Friedman et al. and further refined by Eric Schadt and his research group. The Bayesian network approach has the virtue of identifying predictive relationships between genes from a combination of expression and eQTL data. However, the approach does not provide a mechanistic bases for predictive relationships and is ultimately hampered by an inability to model feedback. A challenge for the future is to produce networks that are both predictive and provide mechanistic understanding. To do so, the methods described in several chapters of this book will need to be integrated. Other chapters of this book describe a number of methods to identify or predict network components such as physical interactions. At the end of this chapter, we speculate that some of the approaches from other chapters could be integrated and used to "annotate" the edges of the Bayesian networks. This would take the Bayesian networks one step closer to providing mechanistic "explanations" for the relationships between the network nodes.
Collapse
Affiliation(s)
- Roger E Bumgarner
- Department of Microbiology, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
10
|
|
11
|
Liang Y, Kelemen A. Bayesian models and meta analysis for multiple tissue gene expression data following corticosteroid administration. BMC Bioinformatics 2008; 9:354. [PMID: 18755028 PMCID: PMC2579308 DOI: 10.1186/1471-2105-9-354] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2008] [Accepted: 08/28/2008] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND This paper addresses key biological problems and statistical issues in the analysis of large gene expression data sets that describe systemic temporal response cascades to therapeutic doses in multiple tissues such as liver, skeletal muscle, and kidney from the same animals. Affymetrix time course gene expression data U34A are obtained from three different tissues including kidney, liver and muscle. Our goal is not only to find the concordance of gene in different tissues, identify the common differentially expressed genes over time and also examine the reproducibility of the findings by integrating the results through meta analysis from multiple tissues in order to gain a significant increase in the power of detecting differentially expressed genes over time and to find the differential differences of three tissues responding to the drug. RESULTS AND CONCLUSION Bayesian categorical model for estimating the proportion of the 'call' are used for pre-screening genes. Hierarchical Bayesian Mixture Model is further developed for the identifications of differentially expressed genes across time and dynamic clusters. Deviance information criterion is applied to determine the number of components for model comparisons and selections. Bayesian mixture model produces the gene-specific posterior probability of differential/non-differential expression and the 95% credible interval, which is the basis for our further Bayesian meta-inference. Meta-analysis is performed in order to identify commonly expressed genes from multiple tissues that may serve as ideal targets for novel treatment strategies and to integrate the results across separate studies. We have found the common expressed genes in the three tissues. However, the up/down/no regulations of these common genes are different at different time points. Moreover, the most differentially expressed genes were found in the liver, then in kidney, and then in muscle.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Organizational Systems and Adult Health, University of Maryland, 655 W. Lombard Street, Baltimore, MD 21201-1579, USA
| | - Arpad Kelemen
- Department of Neurology, Buffalo Neuroimaging Analysis Center, The Jacobs Neurological Institute, University at Buffalo, The State University of New York, 100 High Street, Buffalo, NY 14203, USA
| |
Collapse
|
12
|
Liang Y, Kelemen A. Bayesian state space models for inferring and predicting temporal gene expression profiles. Biom J 2008; 49:801-14. [PMID: 17638289 DOI: 10.1002/bimj.200610335] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Prediction of gene dynamic behavior is a challenging and important problem in genomic research while estimating the temporal correlations and non-stationarity are the keys in this process. Unfortunately, most existing techniques used for the inclusion of the temporal correlations treat the time course as evenly distributed time intervals and use stationary models with time-invariant settings. This is an assumption that is often violated in microarray time course data since the time course expression data are at unequal time points, where the difference in sampling times varies from minutes to days. Furthermore, the unevenly spaced short time courses with sudden changes make the prediction of genetic dynamics difficult. In this paper, we develop two types of Bayesian state space models to tackle this challenge for inferring and predicting the gene expression profiles associated with diseases. In the univariate time-varying Bayesian state space models we treat both the stochastic transition matrix and the observation matrix time-variant with linear setting and point out that this can easily be extended to nonlinear setting. In the multivariate Bayesian state space model we include temporal correlation structures in the covariance matrix estimations. In both models, the unevenly spaced short time courses with unseen time points are treated as hidden state variables. Bayesian approaches with various prior and hyper-prior models with MCMC algorithms are used to estimate the model parameters and hidden variables. We apply our models to multiple tissue polygenetic affymetrix data sets. Results show that the predictions of the genomic dynamic behavior can be well captured by the proposed models.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, University at Buffalo, The State University of New York, 252A2 Farber Hall, 3435 Main Street, Buffalo, NY 14214, USA.
| | | |
Collapse
|
13
|
Comparison of Functions for Filtering Time Course Gene Expression Data with Flat Patterns. KOREAN JOURNAL OF APPLIED STATISTICS 2007. [DOI: 10.5351/kjas.2007.20.2.409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
14
|
Kaminski N, Bar-Joseph Z. A patient-gene model for temporal expression profiles in clinical studies. J Comput Biol 2007; 14:324-38. [PMID: 17563314 DOI: 10.1089/cmb.2007.0001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Pharmacogenomics and clinical studies that measure the temporal expression levels of patients can identify important pathways and biomarkers that are activated during disease progression or in response to treatment. However, researchers face a number of challenges when trying to combine expression profiles from these patients. Unlike studies that rely on lab animals or cell lines, individuals vary in their baseline expression and in their response rate. In this paper we present a generative model for such data. Our model represents patient expression data using two levels, a gene level, which corresponds to a common response pattern, and a patient level, which accounts for the patient specific expression patterns and response rate. Using an EM algorithm, we infer the parameters of the model. We used our algorithm to analyze multiple sclerosis patient response to interferon-beta. As we show, our algorithm was able to improve upon prior methods for combining patients data. In addition, our algorithm was able to correctly identify patient specific response patterns.
Collapse
Affiliation(s)
- Naftali Kaminski
- Simmons Center for Interstitial Lung Disease, University of Pittsburgh Medical School, Pittsburgh, Pennsylvania, USA
| | | |
Collapse
|
15
|
Post hoc pattern matching: assigning significance to statistically defined expression patterns in single channel microarray data. BMC Bioinformatics 2007; 8:240. [PMID: 17615071 PMCID: PMC1934919 DOI: 10.1186/1471-2105-8-240] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2007] [Accepted: 07/05/2007] [Indexed: 11/28/2022] Open
Abstract
Background Researchers using RNA expression microarrays in experimental designs with more than two treatment groups often identify statistically significant genes with ANOVA approaches. However, the ANOVA test does not discriminate which of the multiple treatment groups differ from one another. Thus, post hoc tests, such as linear contrasts, template correlations, and pairwise comparisons are used. Linear contrasts and template correlations work extremely well, especially when the researcher has a priori information pointing to a particular pattern/template among the different treatment groups. Further, all pairwise comparisons can be used to identify particular, treatment group-dependent patterns of gene expression. However, these approaches are biased by the researcher's assumptions, and some treatment-based patterns may fail to be detected using these approaches. Finally, different patterns may have different probabilities of occurring by chance, importantly influencing researchers' conclusions about a pattern and its constituent genes. Results We developed a four step, post hoc pattern matching (PPM) algorithm to automate single channel gene expression pattern identification/significance. First, 1-Way Analysis of Variance (ANOVA), coupled with post hoc 'all pairwise' comparisons are calculated for all genes. Second, for each ANOVA-significant gene, all pairwise contrast results are encoded to create unique pattern ID numbers. The # genes found in each pattern in the data is identified as that pattern's 'actual' frequency. Third, using Monte Carlo simulations, those patterns' frequencies are estimated in random data ('random' gene pattern frequency). Fourth, a Z-score for overrepresentation of the pattern is calculated ('actual' against 'random' gene pattern frequencies). We wrote a Visual Basic program (StatiGen) that automates PPM procedure, constructs an Excel workbook with standardized graphs of overrepresented patterns, and lists of the genes comprising each pattern. The visual basic code, installation files for StatiGen, and sample data are available as supplementary material. Conclusion The PPM procedure is designed to augment current microarray analysis procedures by allowing researchers to incorporate all of the information from post hoc tests to establish unique, overarching gene expression patterns in which there is no overlap in gene membership. In our hands, PPM works well for studies using from three to six treatment groups in which the researcher is interested in treatment-related patterns of gene expression. Hardware/software limitations and extreme number of theoretical expression patterns limit utility for larger numbers of treatment groups. Applied to a published microarray experiment, the StatiGen program successfully flagged patterns that had been manually assigned in prior work, and further identified other gene expression patterns that may be of interest. Thus, over a moderate range of treatment groups, PPM appears to work well. It allows researchers to assign statistical probabilities to patterns of gene expression that fit a priori expectations/hypotheses, it preserves the data's ability to show the researcher interesting, yet unanticipated gene expression patterns, and assigns the majority of ANOVA-significant genes to non-overlapping patterns.
Collapse
|
16
|
Liang Y, Kelemen A. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments. Funct Integr Genomics 2005; 6:1-13. [PMID: 16292543 DOI: 10.1007/s10142-005-0006-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2005] [Revised: 06/22/2005] [Accepted: 08/16/2005] [Indexed: 10/25/2022]
Abstract
Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, NY 14214, USA.
| | | |
Collapse
|