1
|
Forero DA. Available Software for Meta-analyses of Genome-wide Expression Studies. Curr Genomics 2019; 20:325-331. [PMID: 32476989 PMCID: PMC7235394 DOI: 10.2174/1389202920666190822113912] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 07/24/2019] [Accepted: 08/08/2019] [Indexed: 01/24/2023] Open
Abstract
Advances in transcriptomic methods have led to a large number of published Genome-Wide Expression Studies (GWES), in humans and model organisms. For several years, GWES involved the use of microarray platforms to compare genome-expression data for two or more groups of samples of interest. Meta-analysis of GWES is a powerful approach for the identification of differentially expressed genes in biological topics or diseases of interest, combining information from multiple primary studies. In this article, the main features of available software for carrying out meta-analysis of GWES have been reviewed and seven packages from the Bioconductor platform and five packages from the CRAN platform have been described. In addition, nine previously described programs and four online programs are reviewed. Finally, advantages and disadvantages of these available programs and proposed key points for future developments have been discussed.
Collapse
Affiliation(s)
- Diego A Forero
- PhD Program in Health Sciences, School of Medicine, Universidad Antonio Nariño, Bogotá, Colombia.,Laboratory of NeuroPsychiatric Genetics, Biomedical Sciences Research Group, School of Medicine, Universidad Antonio Nariño, Bogotá, Colombia
| |
Collapse
|
2
|
Chikina MD, Sealfon SC. Increasing consistency of disease biomarker prediction across datasets. PLoS One 2014; 9:e91272. [PMID: 24740471 PMCID: PMC3989170 DOI: 10.1371/journal.pone.0091272] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Accepted: 02/10/2014] [Indexed: 11/18/2022] Open
Abstract
Microarray studies with human subjects often have limited sample sizes which hampers the ability to detect reliable biomarkers associated with disease and motivates the need to aggregate data across studies. However, human gene expression measurements may be influenced by many non-random factors such as genetics, sample preparations, and tissue heterogeneity. These factors can contribute to a lack of agreement among related studies, limiting the utility of their aggregation. We show that it is feasible to carry out an automatic correction of individual datasets to reduce the effect of such ‘latent variables’ (without prior knowledge of the variables) in such a way that datasets addressing the same condition show better agreement once each is corrected. We build our approach on the method of surrogate variable analysis but we demonstrate that the original algorithm is unsuitable for the analysis of human tissue samples that are mixtures of different cell types. We propose a modification to SVA that is crucial to obtaining the improvement in agreement that we observe. We develop our method on a compendium of multiple sclerosis data and verify it on an independent compendium of Parkinson's disease datasets. In both cases, we show that our method is able to improve agreement across varying study designs, platforms, and tissues. This approach has the potential for wide applicability to any field where lack of inter-study agreement has been a concern.
Collapse
Affiliation(s)
- Maria D. Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| | - Stuart C. Sealfon
- Department of Neurology, Center for Translational Systems Biology and Department of Neurology, Mount Sinai School of Medicine, New York, New York, United States of America
| |
Collapse
|
3
|
Almeida-de-Macedo MM, Ransom N, Feng Y, Hurst J, Wurtele ES. Comprehensive analysis of correlation coefficients estimated from pooling heterogeneous microarray data. BMC Bioinformatics 2013; 14:214. [PMID: 23822712 PMCID: PMC3765419 DOI: 10.1186/1471-2105-14-214] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2012] [Accepted: 06/21/2013] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The synthesis of information across microarray studies has been performed by combining statistical results of individual studies (as in a mosaic), or by combining data from multiple studies into a large pool to be analyzed as a single data set (as in a melting pot of data). Specific issues relating to data heterogeneity across microarray studies, such as differences within and between labs or differences among experimental conditions, could lead to equivocal results in a melting pot approach. RESULTS We applied statistical theory to determine the specific effect of different means and heteroskedasticity across 19 groups of microarray data on the sign and magnitude of gene-to-gene Pearson correlation coefficients obtained from the pool of 19 groups. We quantified the biases of the pooled coefficients and compared them to the biases of correlations estimated by an effect-size model. Mean differences across the 19 groups were the main factor determining the magnitude and sign of the pooled coefficients, which showed largest values of bias as they approached ±1. Only heteroskedasticity across the pool of 19 groups resulted in less efficient estimations of correlations than did a classical meta-analysis approach of combining correlation coefficients. These results were corroborated by simulation studies involving either mean differences or heteroskedasticity across a pool of N > 2 groups. CONCLUSIONS The combination of statistical results is best suited for synthesizing the correlation between expression profiles of a gene pair across several microarray studies.
Collapse
Affiliation(s)
- Márcia M Almeida-de-Macedo
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
- Current address: Syngenta Seeds Inc, 2369 330th St, Slater, IA 50244, USA
| | - Nick Ransom
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Yaping Feng
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Jonathan Hurst
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Eve Syrkin Wurtele
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
4
|
A data similarity-based strategy for meta-analysis of transcriptional profiles in cancer. PLoS One 2013; 8:e54979. [PMID: 23383020 PMCID: PMC3558433 DOI: 10.1371/journal.pone.0054979] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2012] [Accepted: 12/22/2012] [Indexed: 11/22/2022] Open
Abstract
Background Robust transcriptional signatures in cancer can be identified by data similarity-driven meta-analysis of gene expression profiles. An unbiased data integration and interrogation strategy has not previously been available. Methods and Findings We implemented and performed a large meta-analysis of breast cancer gene expression profiles from 223 datasets containing 10,581 human breast cancer samples using a novel data similarity-based approach (iterative EXALT). Cancer gene expression signatures extracted from individual datasets were clustered by data similarity and consolidated into a meta-signature with a recurrent and concordant gene expression pattern. A retrospective survival analysis was performed to evaluate the predictive power of a novel meta-signature deduced from transcriptional profiling studies of human breast cancer. Validation cohorts consisting of 6,011 breast cancer patients from 21 different breast cancer datasets and 1,110 patients with other malignancies (lung and prostate cancer) were used to test the robustness of our findings. During the iterative EXALT analysis, 633 signatures were grouped by their data similarity and formed 121 signature clusters. From the 121 signature clusters, we identified a unique meta-signature (BRmet50) based on a cluster of 11 signatures sharing a phenotype related to highly aggressive breast cancer. In patients with breast cancer, there was a significant association between BRmet50 and disease outcome, and the prognostic power of BRmet50 was independent of common clinical and pathologic covariates. Furthermore, the prognostic value of BRmet50 was not specific to breast cancer, as it also predicted survival in prostate and lung cancers. Conclusions We have established and implemented a novel data similarity-driven meta-analysis strategy. Using this approach, we identified a transcriptional meta-signature (BRmet50) in breast cancer, and the prognostic performance of BRmet50 was robust and applicable across a wide range of cancer-patient populations.
Collapse
|
5
|
Rudy J, Valafar F. Empirical comparison of cross-platform normalization methods for gene expression data. BMC Bioinformatics 2011; 12:467. [PMID: 22151536 PMCID: PMC3314675 DOI: 10.1186/1471-2105-12-467] [Citation(s) in RCA: 83] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 12/07/2011] [Indexed: 12/13/2022] Open
Abstract
Background Simultaneous measurement of gene expression on a genomic scale can be accomplished using microarray technology or by sequencing based methods. Researchers who perform high throughput gene expression assays often deposit their data in public databases, but heterogeneity of measurement platforms leads to challenges for the combination and comparison of data sets. Researchers wishing to perform cross platform normalization face two major obstacles. First, a choice must be made about which method or methods to employ. Nine are currently available, and no rigorous comparison exists. Second, software for the selected method must be obtained and incorporated into a data analysis workflow. Results Using two publicly available cross-platform testing data sets, cross-platform normalization methods are compared based on inter-platform concordance and on the consistency of gene lists obtained with transformed data. Scatter and ROC-like plots are produced and new statistics based on those plots are introduced to measure the effectiveness of each method. Bootstrapping is employed to obtain distributions for those statistics. The consistency of platform effects across studies is explored theoretically and with respect to the testing data sets. Conclusions Our comparisons indicate that four methods, DWD, EB, GQ, and XPN, are generally effective, while the remaining methods do not adequately correct for platform effects. Of the four successful methods, XPN generally shows the highest inter-platform concordance when treatment groups are equally sized, while DWD is most robust to differently sized treatment groups and consistently shows the smallest loss in gene detection. We provide an R package, CONOR, capable of performing the nine cross-platform normalization methods considered. The package can be downloaded at http://alborz.sdsu.edu/conor and is available from CRAN.
Collapse
Affiliation(s)
- Jason Rudy
- Biomedical Informatics Research Center, San Diego State University, 5500 Campanile Dr, San Diego, CA, USA
| | | |
Collapse
|
6
|
Meta-analysis of gene expression microarrays with missing replicates. BMC Bioinformatics 2011; 12:84. [PMID: 21435268 PMCID: PMC3224118 DOI: 10.1186/1471-2105-12-84] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2010] [Accepted: 03/24/2011] [Indexed: 01/22/2023] Open
Abstract
Background Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful. Results We propose a meta-analysis framework, called "Incomplete Gene Meta-analysis", which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes. Conclusions Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.
Collapse
|
7
|
Castells X, Acebes JJ, Boluda S, Moreno-Torres À, Pujol J, Julià-Sapé M, Candiota AP, Ariño J, Barceló A, Arús C. Development of a Predictor for Human Brain Tumors Based on Gene Expression Values Obtained from Two Types of Microarray Technologies. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 14:157-64. [DOI: 10.1089/omi.2009.0093] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Xavier Castells
- Grup d'Aplicacions Biomèdiques de la RMN (GABRMN), Facultat de Biociències, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Juan José Acebes
- Departament de Neurocirurgia, IDIBELL-Hospital Universitari de Bellvitge, L'Hospitalet de Llobregat, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Susana Boluda
- Institut de Neuropatologia, Servei Anatomia Patològica, IDIBELL-Hospital Universitari de Bellvitge, L'Hospitalet de Llobregat, Barcelona, Spain
| | - Àngel Moreno-Torres
- Research Department, Centre Diagnòstic Pedralbes, Esplugues de Llobregat, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Jesús Pujol
- Institut d'Alta Tecnologia, CRC Corporació Sanitària, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Margarida Julià-Sapé
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Ana Paula Candiota
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| | - Joaquín Ariño
- Departament de Bioquímica i Biologia Molecular, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, Spain
| | - Anna Barceló
- Departament de Bioquímica i Biologia Molecular, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, Spain
| | - Carles Arús
- Grup d'Aplicacions Biomèdiques de la RMN (GABRMN), Facultat de Biociències, Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, Spain
| |
Collapse
|
8
|
Chen L, Borozan I, Sun J, Guindi M, Fischer S, Feld J, Anand N, Heathcote J, Edwards AM, McGilvray ID. Cell-type specific gene expression signature in liver underlies response to interferon therapy in chronic hepatitis C infection. Gastroenterology 2010; 138:1123-33.e1-3. [PMID: 19900446 DOI: 10.1053/j.gastro.2009.10.046] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2009] [Revised: 10/01/2009] [Accepted: 10/29/2009] [Indexed: 02/06/2023]
Abstract
BACKGROUND & AIMS Chronic hepatitis C virus (CHC) infection is treated with interferon/ribavirin, but only a subset of patients respond. Treatment nonresponders have marked pretreatment up-regulation of a subset of interferon stimulated genes (ISGs) in their livers, including ISG15. We here study how the nonresponder gene expression phenotype is influenced by clinical factors and uncover the cellular basis of the phenotype through ISG15 protein expression. METHODS Seventy-eight CHC patients undergoing treatment were classified by clinical (gender, viral genotype, viral load, treatment outcome) and histologic (inflammation, fibrosis) factors and subjected to gene expression profiling on their pretreatment liver biopsies. An analysis of variance model was used to study the influence of individual factors on gene expression. ISG15 immunohistochemistry was performed on a subset of 31 liver biopsy specimens. RESULTS One hundred twenty-three genes were differentially expressed in the 78 CHC livers when compared with 20 normal livers (P < .001; fold change, > or =1.5-fold). Of genes influenced by a single factor, genotype (1 vs 2/3) influenced more genes (17) than any other variable; when treatment outcome was included in the analysis, this became the predominant influence (24 genes), and the effect of genotype was diminished. Treatment response was linked to cell-specific activation patterns: ISG15 protein up-regulation was more pronounced in hepatocytes in treatment nonresponders but in Kuppfer cells in responders. CONCLUSIONS Genotype is a surrogate marker for the nonresponder phenotype. This phenotype manifests as differential gene expression and is driven by activation of different cell types: hepatocytes in treatment nonresponders and macrophages in treatment responders.
Collapse
Affiliation(s)
- Limin Chen
- Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Kugler KG, Mueller LA, Graber A. MADAM - An open source meta-analysis toolbox for R and Bioconductor. SOURCE CODE FOR BIOLOGY AND MEDICINE 2010; 5:3. [PMID: 20193058 PMCID: PMC2848045 DOI: 10.1186/1751-0473-5-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2009] [Accepted: 03/01/2010] [Indexed: 11/10/2022]
Abstract
BACKGROUND Meta-analysis is a major theme in biomedical research. In the present paper we introduce a package for R and Bioconductor that provides useful tools for performing this type of work. One idea behind the development of MADAM was that many meta-analysis methods, which are available in R, are not able to use the capacities of parallel computing yet. In this first version, we implemented one meta-analysis method in such a parallel manner. Additionally, we provide tools for combining the results from a set of methods in an ensemble approach. Functionality for visualization of results is also provided. RESULTS The presented package enables the carrying out of meta-analysis either by providing functions directly or by wrapping them to existing implementations. Overall, five different meta-analysis methods are now usable through MADAM, along with another three methods for combining the corresponding results. Visualizing the results is eased by three included functions. For developing and testing meta-analysis methods, a mock up data generator is integrated. CONCLUSIONS The use of MADAM enables a user to focus on one package, in turn enabling them to work with the same data types across a set of methods. By making use of the snow package, MADAM can be made compatible with an existing parallel computing infrastructure. MADAM is open source and freely available within CRAN http://cran.r-project.org.
Collapse
Affiliation(s)
- Karl G Kugler
- Institute for Bioinformatics and Translational Research, UMIT, Eduard Wallnöfer-Zentrum 1, Hall in Tirol, 6060, Austria.
| | | | | |
Collapse
|
10
|
Mistry M, Pavlidis P. A cross-laboratory comparison of expression profiling data from normal human postmortem brain. Neuroscience 2010; 167:384-95. [PMID: 20138973 DOI: 10.1016/j.neuroscience.2010.01.016] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2009] [Revised: 01/08/2010] [Accepted: 01/08/2010] [Indexed: 11/29/2022]
Abstract
Expression profiling of post-mortem human brain tissue has been widely used to study molecular changes associated with neuropsychiatric diseases as well as normal processes such as aging. Changes in expression associated with factors such as age, gender or postmortem interval are often more pronounced than changes associated with disease. Therefore in addition to being of interest in their own right, careful consideration of these effects are important in the interpretation of disease studies. We performed a large meta-analysis of genome-wide expression studies of normal human cortex to more fully catalogue the effects of age, gender, postmortem interval and brain pH, yielding a "meta-signature" of gene expression changes for each factor. We validated our results by showing a significant overlap with independent gene lists extracted from the literature. Importantly, meta-analysis identifies genes which are not significant in any individual study. Finally, we show that many schizophrenia candidate genes appear in the meta-signatures, reinforcing the idea that studies must be carefully controlled for interactions between these factors and disease. In addition to the inherent value of the meta-signatures, our results provide critical information for future studies of disease effects in the human brain.
Collapse
Affiliation(s)
- M Mistry
- Canadian Institute of Health Research/Michael Smith Foundation for Health Research (CIHR/MSFHR) Graduate Program in Bioinformatics, University of British Columbia, BC, Canada
| | | |
Collapse
|
11
|
Lai Y, Eckenrode SE, She JX. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics 2009; 10 Suppl 1:S23. [PMID: 19208123 PMCID: PMC2648727 DOI: 10.1186/1471-2105-10-s1-s23] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genome-wide concordance of different data sets has not been well considered in the analysis. RESULTS Before considering data integration, it is necessary to evaluate the genome-wide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genome-wide concordance and the data integration can be achieved based on the normal distribution based mixture models. CONCLUSION The results from our simulation study suggest that misleading results can be generated if the genome-wide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.
Collapse
Affiliation(s)
- Yinglei Lai
- Department of Statistics and Biostatistics Center, The George Washington University, 2140 Pennsylvania Avenue, N.W., Washington, D.C. 20052, USA
| | - Sarah E Eckenrode
- Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th street, CA4098, GA 30912, USA
| | - Jin-Xiong She
- Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th street, CA4098, GA 30912, USA
| |
Collapse
|