1
|
Darbani B, Stewart CN, Noeparvar S, Borg S. Correction of gene expression data: Performance-dependency on inter-replicate and inter-treatment biases. J Biotechnol 2014; 188:100-9. [PMID: 25150216 DOI: 10.1016/j.jbiotec.2014.08.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 07/31/2014] [Accepted: 08/12/2014] [Indexed: 11/28/2022]
Abstract
This report investigates for the first time the potential inter-treatment bias source of cell number for gene expression studies. Cell-number bias can affect gene expression analysis when comparing samples with unequal total cellular RNA content or with different RNA extraction efficiencies. For maximal reliability of analysis, therefore, comparisons should be performed at the cellular level. This could be accomplished using an appropriate correction method that can detect and remove the inter-treatment bias for cell-number. Based on inter-treatment variations of reference genes, we introduce an analytical approach to examine the suitability of correction methods by considering the inter-treatment bias as well as the inter-replicate variance, which allows use of the best correction method with minimum residual bias. Analyses of RNA sequencing and microarray data showed that the efficiencies of correction methods are influenced by the inter-treatment bias as well as the inter-replicate variance. Therefore, we recommend inspecting both of the bias sources in order to apply the most efficient correction method. As an alternative correction strategy, sequential application of different correction approaches is also advised.
Collapse
Affiliation(s)
- Behrooz Darbani
- Department of Molecular Biology and Genetics, Research Centre Flakkebjerg, Aarhus University, Forsøgsvej 1, 4200 Slagelse, Denmark; Department of Plant and Environmental Sciences, University of Copenhagen, 1871 Frederiksberg, Denmark.
| | - C Neal Stewart
- Department of Plant Sciences, University of Tennessee-Knoxville, Knoxville, Tennessee 37996-4561, USA
| | - Shahin Noeparvar
- Department of Molecular Biology and Genetics, Research Centre Flakkebjerg, Aarhus University, Forsøgsvej 1, 4200 Slagelse, Denmark
| | - Søren Borg
- Department of Molecular Biology and Genetics, Research Centre Flakkebjerg, Aarhus University, Forsøgsvej 1, 4200 Slagelse, Denmark
| |
Collapse
|
2
|
Potts MB, Kim HS, Fisher KW, Hu Y, Carrasco YP, Bulut GB, Ou YH, Herrera-Herrera ML, Cubillos F, Mendiratta S, Xiao G, Hofree M, Ideker T, Xie Y, Huang LJS, Lewis RE, MacMillan JB, White MA. Using functional signature ontology (FUSION) to identify mechanisms of action for natural products. Sci Signal 2013; 6:ra90. [PMID: 24129700 DOI: 10.1126/scisignal.2004657] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
A challenge for biomedical research is the development of pharmaceuticals that appropriately target disease mechanisms. Natural products can be a rich source of bioactive chemicals for medicinal applications but can act through unknown mechanisms and can be difficult to produce or obtain. To address these challenges, we developed a new marine-derived, renewable natural products resource and a method for linking bioactive derivatives of this library to the proteins and biological processes that they target in cells. We used cell-based screening and computational analysis to match gene expression signatures produced by natural products to those produced by small interfering RNA (siRNA) and synthetic microRNA (miRNA) libraries. With this strategy, we matched proteins and miRNAs with diverse biological processes and also identified putative protein targets and mechanisms of action for several previously undescribed marine-derived natural products. We confirmed mechanistic relationships for selected siRNAs, miRNAs, and compounds with functional roles in autophagy, chemotaxis mediated by discoidin domain receptor 2, or activation of the kinase AKT. Thus, this approach may be an effective method for screening new drugs while simultaneously identifying their targets.
Collapse
Affiliation(s)
- Malia B Potts
- 1Department of Cell Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Depuydt G, Xie F, Petyuk VA, Shanmugam N, Smolders A, Dhondt I, Brewer HM, Camp DG, Smith RD, Braeckman BP. Reduced insulin/insulin-like growth factor-1 signaling and dietary restriction inhibit translation but preserve muscle mass in Caenorhabditis elegans. Mol Cell Proteomics 2013; 12:3624-39. [PMID: 24002365 PMCID: PMC3861712 DOI: 10.1074/mcp.m113.027383] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Reduced signaling through the C. elegans insulin/insulin-like growth factor-1-like tyrosine kinase receptor daf-2 and dietary restriction via bacterial dilution are two well-characterized lifespan-extending interventions that operate in parallel or through (partially) independent mechanisms. Using accurate mass and time tag LC-MS/MS quantitative proteomics, we detected that the abundance of a large number of ribosomal subunits is decreased in response to dietary restriction, as well as in the daf-2(e1370) insulin/insulin-like growth factor-1-receptor mutant. In addition, general protein synthesis levels in these long-lived worms are repressed. Surprisingly, ribosomal transcript levels were not correlated to actual protein abundance, suggesting that post-transcriptional regulation determines ribosome content. Proteomics also revealed the increased presence of many structural muscle cell components in long-lived worms, which appeared to result from the prioritized preservation of muscle cell volume in nutrient-poor conditions or low insulin-like signaling. Activation of DAF-16, but not diet restriction, stimulates mRNA expression of muscle-related genes to prevent muscle atrophy. Important daf-2-specific proteome changes include overexpression of aerobic metabolism enzymes and general activation of stress-responsive and immune defense systems, whereas the increased abundance of many protein subunits of the proteasome core complex is a dietary-restriction-specific characteristic.
Collapse
Affiliation(s)
- Geert Depuydt
- Biology Department, Ghent University, Proeftuinstraat 86 N1, B-9000 Ghent, Belgium
| | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Linghu B, Franzosa EA, Xia Y. Construction of functional linkage gene networks by data integration. Methods Mol Biol 2013. [PMID: 23192549 DOI: 10.1007/978-1-62703-107-3_14] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Networks of functional associations between genes have recently been successfully used for gene function and disease-related research. A typical approach for constructing such functional linkage gene networks (FLNs) is based on the integration of diverse high-throughput functional genomics datasets. Data integration is a nontrivial task due to the heterogeneous nature of the different data sources and their variable accuracy and completeness. The presence of correlations between data sources also adds another layer of complexity to the integration process. In this chapter we discuss an approach for constructing a human FLN from data integration and a subsequent application of the FLN to novel disease gene discovery. Similar approaches can be applied to nonhuman species and other discovery tasks.
Collapse
Affiliation(s)
- Bolan Linghu
- Translational Sciences Department, Novartis Institutes for BioMedical Research, Cambridge, MA, USA.
| | | | | |
Collapse
|
5
|
Kihira S, Yu EJ, Cunningham J, Cram EJ, Lee M. A novel mutation in β integrin reveals an integrin-mediated interaction between the extracellular matrix and cki-1/p27KIP1. PLoS One 2012; 7:e42425. [PMID: 22879977 PMCID: PMC3412830 DOI: 10.1371/journal.pone.0042425] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Accepted: 07/09/2012] [Indexed: 01/20/2023] Open
Abstract
The cell-extracellular matrix (ECM) interaction plays an essential role in maintaining tissue shapes and regulates cell behaviors such as cell adhesion, differentiation and proliferation. The mechanism by which the ECM influences the cell cycle in vivo is poorly understood. Here we demonstrate that the β integrin PAT-3 regulates the localization and expression of CKI-1, a C. elegans homologue of the cyclin dependent kinase inhibitor p27(KIP1). In nematodes expressing wild type PAT-3, CKI-1::GFP localizes primarily to nucleoli in hypodermal cells, whereas in animals expressing mutant pat-3 with a defective splice junction, CKI-1::GFP appears clumped and disorganized in nucleoplasm. RNAi analysis links cell adhesion genes to the regulation of CKI-1. RNAi of unc-52/perlecan, ina-1/α integrin, pat-4/ILK, and unc-97/PINCH resulted in abnormal CKI-1::GFP localization. Additional RNAi experiments revealed that the SCF E3 ubiquitin-ligase complex genes, skpt-1/SKP2, cul-1/CUL1 and lin-23/F-box, are required for the proper localization and expression of CKI-1, suggesting that integrin signaling and SCF E3 ligase work together to regulate the cellular distribution of CKI-1. These data also suggest that integrin plays a major role in maintaining proper CKI-1/p27(KIP1) levels in the cell. Perturbed integrin signaling may lead to the inhibition of SCF ligase activity, mislocalization and elevation of CKI-1/p27(KIP1). These results suggest that adhesion signaling is crucial for cell cycle regulation in vivo.
Collapse
Affiliation(s)
- Shingo Kihira
- Department of Biology, Baylor University, Waco, Texas, United States of America
| | - Eun Jeong Yu
- Department of Biology, Baylor University, Waco, Texas, United States of America
| | - Jessica Cunningham
- Department of Biology, Baylor University, Waco, Texas, United States of America
| | - Erin J. Cram
- Department of Biology, Northeastern University, Boston, Massachusetts, United States of America
| | - Myeongwoo Lee
- Department of Biology, Baylor University, Waco, Texas, United States of America
| |
Collapse
|
6
|
Swarbreck SM, Defoin-Platel M, Hindle M, Saqi M, Habash DZ. New perspectives on glutamine synthetase in grasses. JOURNAL OF EXPERIMENTAL BOTANY 2011; 62:1511-22. [PMID: 21172814 DOI: 10.1093/jxb/erq356] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Members of the glutamine synthetase (GS) gene family have now been characterized in many crop species such as wheat, rice, and maize. Studies have shown that cytosolic GS isoforms are involved in nitrogen remobilization during leaf senescence and emphasized a role in seed production particularly in small grain crop species. Data from the sequencing of genomes for model crops and expressed sequence tag (EST) libraries from non-model species have strengthened the idea that the cytosolic GS genes are organized in three functionally and phylogenetically conserved subfamilies. Using a bioinformatic approach, the considerable publicly available information on high throughput gene expression was mined to search for genes having patterns of expression similar to GS. Interesting new hypotheses have emerged from searching for co-expressed genes across multiple unfiltered experimental data sets in rice. This approach should inform new experimental designs and studies to explore the regulation of the GS gene family further. It is expected that understanding the regulation of GS under varied climatic conditions will emerge as an important new area considering the results from recent studies that have shown nitrogen assimilation to be critical to plant acclimation to high CO(2) concentrations.
Collapse
Affiliation(s)
- Stéphanie M Swarbreck
- Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | | | | | | | | |
Collapse
|
7
|
Serb JM, Orr MC, West Greenlee MH. Using evolutionary conserved modules in gene networks as a strategy to leverage high throughput gene expression queries. PLoS One 2010; 5:e12525. [PMID: 20824082 PMCID: PMC2932711 DOI: 10.1371/journal.pone.0012525] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2010] [Accepted: 08/04/2010] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Large-scale gene expression studies have not yielded the expected insight into genetic networks that control complex processes. These anticipated discoveries have been limited not by technology, but by a lack of effective strategies to investigate the data in a manageable and meaningful way. Previous work suggests that using a pre-determined seed-network of gene relationships to query large-scale expression datasets is an effective way to generate candidate genes for further study and network expansion or enrichment. Based on the evolutionary conservation of gene relationships, we test the hypothesis that a seed network derived from studies of retinal cell determination in the fly, Drosophila melanogaster, will be an effective way to identify novel candidate genes for their role in mouse retinal development. METHODOLOGY/PRINCIPAL FINDINGS Our results demonstrate that a number of gene relationships regulating retinal cell differentiation in the fly are identifiable as pairwise correlations between genes from developing mouse retina. In addition, we demonstrate that our extracted seed-network of correlated mouse genes is an effective tool for querying datasets and provides a context to generate hypotheses. Our query identified 46 genes correlated with our extracted seed-network members. Approximately 54% of these candidates had been previously linked to the developing brain and 33% had been previously linked to the developing retina. Five of six candidate genes investigated further were validated by experiments examining spatial and temporal protein expression in the developing retina. CONCLUSIONS/SIGNIFICANCE We present an effective strategy for pursuing a systems biology approach that utilizes an evolutionary comparative framework between two model organisms, fly and mouse. Future implementation of this strategy will be useful to determine the extent of network conservation, not just gene conservation, between species and will facilitate the use of prior biological knowledge to develop rational systems-based hypotheses.
Collapse
Affiliation(s)
- Jeanne M Serb
- Department of Ecology, Evolution and Organismal Biology, Iowa State University, Ames, Iowa, United States of America.
| | | | | |
Collapse
|
8
|
Aid-Pavlidis T, Pavlidis P, Timmusk T. Meta-coexpression conservation analysis of microarray data: a "subset" approach provides insight into brain-derived neurotrophic factor regulation. BMC Genomics 2009; 10:420. [PMID: 19737418 PMCID: PMC2748098 DOI: 10.1186/1471-2164-10-420] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2008] [Accepted: 09/08/2009] [Indexed: 11/26/2022] Open
Abstract
Background Alterations in brain-derived neurotrophic factor (BDNF) gene expression contribute to serious pathologies such as depression, epilepsy, cancer, Alzheimer's, Huntington and Parkinson's disease. Therefore, exploring the mechanisms of BDNF regulation represents a great clinical importance. Studying BDNF expression remains difficult due to its multiple neural activity-dependent and tissue-specific promoters. Thus, microarray data could provide insight into the regulation of this complex gene. Conventional microarray co-expression analysis is usually carried out by merging the datasets or by confirming the re-occurrence of significant correlations across datasets. However, co-expression patterns can be different under various conditions that are represented by subsets in a dataset. Therefore, assessing co-expression by measuring correlation coefficient across merged samples of a dataset or by merging datasets might not capture all correlation patterns. Results In our study, we performed meta-coexpression analysis of publicly available microarray data using BDNF as a "guide-gene" introducing a "subset" approach. The key steps of the analysis included: dividing datasets into subsets with biologically meaningful sample content (e.g. tissue, gender or disease state subsets); analyzing co-expression with the BDNF gene in each subset separately; and confirming co- expression links across subsets. Finally, we analyzed conservation in co-expression with BDNF between human, mouse and rat, and sought for conserved over-represented TFBSs in BDNF and BDNF-correlated genes. Correlated genes discovered in this study regulate nervous system development, and are associated with various types of cancer and neurological disorders. Also, several transcription factor identified here have been reported to regulate BDNF expression in vitro and in vivo. Conclusion The study demonstrates the potential of the "subset" approach in co-expression conservation analysis for studying the regulation of single genes and proposes novel regulators of BDNF gene expression.
Collapse
Affiliation(s)
- Tamara Aid-Pavlidis
- Department of Gene Technology, Tallinn University of Technology, Akadeemia tee 15, 19086 Tallinn, Estonia.
| | | | | |
Collapse
|
9
|
Linghu B, Snitkin ES, Hu Z, Xia Y, Delisi C. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol 2009; 10:R91. [PMID: 19728866 PMCID: PMC2768980 DOI: 10.1186/gb-2009-10-9-r91] [Citation(s) in RCA: 180] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2009] [Revised: 07/09/2009] [Accepted: 09/03/2009] [Indexed: 11/16/2022] Open
Abstract
An evidence-weighted functional-linkage network of human genes reveals associations among diseases that share no known disease genes and have dissimilar phenotypes
We integrate 16 genomic features to construct an evidence-weighted functional-linkage network comprising 21,657 human genes. The functional-linkage network is used to prioritize candidate genes for 110 diseases, and to reliably disclose hidden associations between disease pairs having dissimilar phenotypes, such as hypercholesterolemia and Alzheimer's disease. Many of these disease-disease associations are supported by epidemiology, but with no previous genetic basis. Such associations can drive novel hypotheses on molecular mechanisms of diseases and therapies.
Collapse
Affiliation(s)
- Bolan Linghu
- Bioinformatics Program, Boston University, 24 Cummington Street, Boston, MA 02215, USA.
| | | | | | | | | |
Collapse
|
10
|
Hornshøj H, Bendixen E, Conley LN, Andersen PK, Hedegaard J, Panitz F, Bendixen C. Transcriptomic and proteomic profiling of two porcine tissues using high-throughput technologies. BMC Genomics 2009; 10:30. [PMID: 19152685 PMCID: PMC2633351 DOI: 10.1186/1471-2164-10-30] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2008] [Accepted: 01/19/2009] [Indexed: 02/03/2023] Open
Abstract
Background The recent development within high-throughput technologies for expression profiling has allowed for parallel analysis of transcriptomes and proteomes in biological systems such as comparative analysis of transcript and protein levels of tissue regulated genes. Until now, such studies of have only included microarray or short length sequence tags for transcript profiling. Furthermore, most comparisons of transcript and protein levels have been based on absolute expression values from within the same tissue and not relative expression values based on tissue ratios. Results Presented here is a novel study of two porcine tissues based on integrative analysis of data from expression profiling of identical samples using cDNA microarray, 454-sequencing and iTRAQ-based proteomics. Sequence homology identified 2.541 unique transcripts that are detectable by both microarray hybridizations and 454-sequencing of 1.2 million cDNA tags. Both transcript-based technologies showed high reproducibility between sample replicates of the same tissue, but the correlation across these two technologies was modest. Thousands of genes being differentially expressed were identified with microarray. Out of the 306 differentially expressed genes, identified by 454-sequencing, 198 (65%) were also found by microarray. The relationship between the regulation of transcript and protein levels was analyzed by integrating iTRAQ-based proteomics data. Protein expression ratios were determined for 354 genes, of which 148 could be mapped to both microarray and 454-sequencing data. A comparison of the expression ratios from the three technologies revealed that differences in transcript and protein levels across heart and muscle tissues are positively correlated. Conclusion We show that the reproducibility within cDNA microarray and 454-sequencing is high, but that the agreement across these two technologies is modest. We demonstrate that the regulation of transcript and protein levels across identical tissue samples is positively correlated when the tissue expression ratios are used for comparison. The results presented are of interest in systems biology research in terms of integration and analysis of high-throughput expression data from mammalian tissues.
Collapse
Affiliation(s)
- Henrik Hornshøj
- Department of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University, Tjele, Denmark.
| | | | | | | | | | | | | |
Collapse
|
11
|
Prieto C, Risueño A, Fontanillo C, De Las Rivas J. Human gene coexpression landscape: confident network derived from tissue transcriptomic profiles. PLoS One 2008; 3:e3911. [PMID: 19081792 PMCID: PMC2597745 DOI: 10.1371/journal.pone.0003911] [Citation(s) in RCA: 187] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2008] [Accepted: 11/05/2008] [Indexed: 12/12/2022] Open
Abstract
Background Analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. However, most studies done at global “omic” scale are not focused on human samples and when they correspond to human very often include heterogeneous datasets, mixing normal with disease-altered samples. Moreover, the technical noise present in genome-wide expression microarrays is another well reported problem that many times is not addressed with robust statistical methods, and the estimation of errors in the data is not provided. Methodology/Principal Findings Human genome-wide expression data from a controlled set of normal-healthy tissues is used to build a confident human gene coexpression network avoiding both pathological and technical noise. To achieve this we describe a new method that combines several statistical and computational strategies: robust normalization and expression signal calculation; correlation coefficients obtained by parametric and non-parametric methods; random cross-validations; and estimation of the statistical accuracy and coverage of the data. All these methods provide a series of coexpression datasets where the level of error is measured and can be tuned. To define the errors, the rates of true positives are calculated by assignment to biological pathways. The results provide a confident human gene coexpression network that includes 3327 gene-nodes and 15841 coexpression-links and a comparative analysis shows good improvement over previously published datasets. Further functional analysis of a subset core network, validated by two independent methods, shows coherent biological modules that share common transcription factors. The network reveals a map of coexpression clusters organized in well defined functional constellations. Two major regions in this network correspond to genes involved in nuclear and mitochondrial metabolism and investigations on their functional assignment indicate that more than 60% are house-keeping and essential genes. The network displays new non-described gene associations and it allows the placement in a functional context of some unknown non-assigned genes based on their interactions with known gene families. Conclusions/Significance The identification of stable and reliable human gene to gene coexpression networks is essential to unravel the interactions and functional correlations between human genes at an omic scale. This work contributes to this aim, and we are making available for the scientific community the validated human gene coexpression networks obtained, to allow further analyses on the network or on some specific gene associations. The data are available free online at http://bioinfow.dep.usal.es/coexpression/.
Collapse
Affiliation(s)
- Carlos Prieto
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain
| | - Alberto Risueño
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain
| | - Celia Fontanillo
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain
| | - Javier De Las Rivas
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC-IBMCC, CSIC/USAL), Salamanca, Spain
- * E-mail:
| |
Collapse
|
12
|
Morozova O, Morozov V, Hoffman BG, Helgason CD, Marra MA. A seriation approach for visualization-driven discovery of co-expression patterns in Serial Analysis of Gene Expression (SAGE) data. PLoS One 2008; 3:e3205. [PMID: 18787709 PMCID: PMC2527533 DOI: 10.1371/journal.pone.0003205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2008] [Accepted: 08/19/2008] [Indexed: 11/20/2022] Open
Abstract
Background Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives. Principal Findings Here we explore the use of seriation, a statistical approach for ordering sets of objects based on their similarity, for large-scale expression pattern discovery in SAGE data. For this specific task we implement a seriation heuristic we term ‘progressive construction of contigs’ that constructs local chains of related elements by sequentially rearranging margins of the correlation matrix. We apply the heuristic to the analysis of simulated and experimental SAGE data and compare our results to those obtained with a clustering algorithm developed specifically for SAGE data. We show using simulations that the performance of seriation compares favorably to that of the clustering algorithm on noisy SAGE data. Conclusions We explore the use of a seriation approach for visualization-based pattern discovery in SAGE data. Using both simulations and experimental data, we demonstrate that seriation is able to identify groups of co-expressed genes more accurately than a clustering algorithm developed specifically for SAGE data. Our results suggest that seriation is a useful method for the analysis of gene expression data whose applicability should be further pursued.
Collapse
Affiliation(s)
- Olena Morozova
- Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, Canada.
| | | | | | | | | |
Collapse
|
13
|
Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data. PLoS One 2008; 3:e2439. [PMID: 18560577 PMCID: PMC2409962 DOI: 10.1371/journal.pone.0002439] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2008] [Accepted: 04/24/2008] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND High-throughput gene expression data can predict gene function through the "guilt by association" principle: coexpressed genes are likely to be functionally associated. METHODOLOGY/PRINCIPAL FINDINGS We analyzed publicly available expression data on normal human tissues. The analysis is based on the integration of data obtained with two experimental platforms (microarrays and SAGE) and of various measures of dissimilarity between expression profiles. The building blocks of the procedure are the Ranked Coexpression Groups (RCG), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation. Functionally characterized RCGs are selected by means of the majority rule and used to predict new functional annotations. Functionally characterized RCGs are enriched in groups of genes associated to similar phenotypes. We exploit this fact to find new candidate disease genes for many OMIM phenotypes of unknown molecular origin. CONCLUSIONS/SIGNIFICANCE We predict new functional annotations for many human genes, showing that the integration of different data sets and coexpression measures significantly improves the scope of the results. Combining gene expression data, functional annotation and known phenotype-gene associations we provide candidate genes for several genetic diseases of unknown molecular basis.
Collapse
|
14
|
Castelein N, Hoogewijs D, De Vreese A, Braeckman BP, Vanfleteren JR. Dietary restriction by growth in axenic medium induces discrete changes in the transcriptional output of genes involved in energy metabolism inCaenorhabditis elegans. Biotechnol J 2008; 3:803-12. [DOI: 10.1002/biot.200800003] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
15
|
Nelson PT, Wang WX, Wilfred BR, Tang G. Technical variables in high-throughput miRNA expression profiling: much work remains to be done. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2008; 1779:758-65. [PMID: 18439437 DOI: 10.1016/j.bbagrm.2008.03.012] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Revised: 03/24/2008] [Accepted: 03/26/2008] [Indexed: 12/11/2022]
Abstract
MicroRNA (miRNA) gene expression profiling has provided important insights into plant and animal biology. However, there has not been ample published work about pitfalls associated with technical parameters in miRNA gene expression profiling. One source of pertinent information about technical variables in gene expression profiling is the separate and more well-established literature regarding mRNA expression profiling. However, many aspects of miRNA biochemistry are unique. For example, the cellular processing and compartmentation of miRNAs, the differential stability of specific miRNAs, and aspects of global miRNA expression regulation require specific consideration. Additional possible sources of systematic bias in miRNA expression studies include the differential impact of pre-analytical variables, substrate specificity of nucleic acid processing enzymes used in labeling and amplification, and issues regarding new miRNA discovery and annotation. We conclude that greater focus on technical parameters is required to bolster the validity, reliability, and cultural credibility of miRNA gene expression profiling studies.
Collapse
Affiliation(s)
- Peter T Nelson
- Department of Pathology and Sanders-Brown Center, University of Kentucky, Lexington, KY 40536, USA.
| | | | | | | |
Collapse
|
16
|
Hecker LA, Alcon TC, Honavar VG, Greenlee MHW. Using a seed-network to query multiple large-scale gene expression datasets from the developing retina in order to identify and prioritize experimental targets. Bioinform Biol Insights 2008; 2:401-12. [PMID: 19812791 PMCID: PMC2735966 DOI: 10.4137/bbi.s417] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Understanding the gene networks that orchestrate the differentiation of retinal progenitors into photoreceptors in the developing retina is important not only due to its therapeutic applications in treating retinal degeneration but also because the developing retina provides an excellent model for studying CNS development. Although several studies have profiled changes in gene expression during normal retinal development, these studies offer at best only a starting point for functional studies focused on a smaller subset of genes. The large number of genes profiled at comparatively few time points makes it extremely difficult to reliably infer gene networks from a gene expression dataset. We describe a novel approach to identify and prioritize from multiple gene expression datasets, a small subset of the genes that are likely to be good candidates for further experimental investigation. We report progress on addressing this problem using a novel approach to querying multiple large-scale expression datasets using a 'seed network' consisting of a small set of genes that are implicated by published studies in rod photoreceptor differentiation. We use the seed network to identify and sort a list of genes whose expression levels are highly correlated with those of multiple seed network genes in at least two of the five gene expression datasets. The fact that several of the genes in this list have been demonstrated, through experimental studies reported in the literature, to be important in rod photoreceptor function provides support for the utility of this approach in prioritizing experimental targets for further experimental investigation. Based on Gene Ontology and KEGG pathway annotations for the list of genes obtained in the context of other information available in the literature, we identified seven genes or groups of genes for possible inclusion in the gene network involved in differentiation of retinal progenitor cells into rod photoreceptors. Our approach to querying multiple gene expression datasets using a seed network constructed from known interactions between specific genes of interest provides a promising strategy for focusing hypothesis-driven experiments using large-scale 'omics' data.
Collapse
Affiliation(s)
- Laura A Hecker
- Interdepartmental Neuroscience Program, Iowa State University, Ames, IA 50011, USA
| | | | | | | |
Collapse
|
17
|
Lee CK, Sunkin SM, Kuan C, Thompson CL, Pathak S, Ng L, Lau C, Fischer S, Mortrud M, Slaughterbeck C, Jones A, Lein E, Hawrylycz M. Quantitative methods for genome-scale analysis of in situ hybridization and correlation with microarray data. Genome Biol 2008; 9:R23. [PMID: 18234097 PMCID: PMC2395252 DOI: 10.1186/gb-2008-9-1-r23] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2007] [Revised: 12/21/2007] [Accepted: 01/30/2008] [Indexed: 02/06/2023] Open
Abstract
This study introduces a novel method for standardized relative quantification of colorimetric in situ hybridization signal that enables a large-scale cross-platform expression level comparison of in situ hybridization with two publicly available microarray brain data sources. With the emergence of genome-wide colorimetric in situ hybridization (ISH) data sets such as the Allen Brain Atlas, it is important to understand the relationship between this gene expression modality and those derived from more quantitative based technologies. This study introduces a novel method for standardized relative quantification of colorimetric ISH signal that enables a large-scale cross-platform expression level comparison of ISH with two publicly available microarray brain data sources.
Collapse
Affiliation(s)
- Chang-Kyu Lee
- Allen Institute for Brain Science, Seattle, WA 98103, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 2007; 8:250. [PMID: 17626636 PMCID: PMC1941745 DOI: 10.1186/1471-2105-8-250] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2006] [Accepted: 07/12/2007] [Indexed: 11/23/2022] Open
Abstract
Background The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes). Results We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods. Conclusion The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.
Collapse
|
19
|
Zhang X, Wang W. An Efficient Algorithm for Mining Coherent Patterns from Heterogeneous Microarrays. ACTA ACUST UNITED AC 2007. [DOI: 10.1109/ssdbm.2007.30] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
20
|
Zhu D, Li Y, Li H. Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data. ACTA ACUST UNITED AC 2007; 23:2298-305. [PMID: 17586543 DOI: 10.1093/bioinformatics/btm328] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
UNLABELLED Estimating pairwise correlation from replicated genome-scale (a.k.a. OMICS) data is fundamental to cluster functionally relevant biomolecules to a cellular pathway. The popular Pearson correlation coefficient estimates bivariate correlation by averaging over replicates. It is not completely satisfactory since it introduces strong bias while reducing variance. We propose a new multivariate correlation estimator that models all replicates as independent and identically distributed (i.i.d.) samples from the multivariate normal distribution. We derive the estimator by maximizing the likelihood function. For small sample data, we provide a resampling-based statistical inference procedure, and for moderate to large sample data, we provide an asymptotic statistical inference procedure based on the Likelihood Ratio Test (LRT). We demonstrate advantages of the new multivariate correlation estimator over Pearson bivariate correlation estimator using simulations and real-world data analysis examples. AVAILABILITY The estimator and statistical inference procedures have been implemented in an R package 'CORREP' that is available from CRAN [http://cran.r-project.org] and Bioconductor [http://www.bioconductor.org/]. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongxiao Zhu
- Stowers Institute for Medical Research, 1000 E 50th Street, Kansas City, MO 64110, USA.
| | | | | |
Collapse
|
21
|
Joshi S, Davies H, Sims LP, Levy SE, Dean J. Ovarian gene expression in the absence of FIGLA, an oocyte-specific transcription factor. BMC DEVELOPMENTAL BIOLOGY 2007; 7:67. [PMID: 17567914 PMCID: PMC1906760 DOI: 10.1186/1471-213x-7-67] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 06/13/2007] [Indexed: 11/10/2022]
Abstract
Background Ovarian folliculogenesis in mammals is a complex process involving interactions between germ and somatic cells. Carefully orchestrated expression of transcription factors, cell adhesion molecules and growth factors are required for success. We have identified a germ-cell specific, basic helix-loop-helix transcription factor, FIGLA (Factor In the GermLine, Alpha) and demonstrated its involvement in two independent developmental processes: formation of the primordial follicle and coordinate expression of zona pellucida genes. Results Taking advantage of Figla null mouse lines, we have used a combined approach of microarray and Serial Analysis of Gene Expression (SAGE) to identify potential downstream target genes. Using high stringent cutoffs, we find that FIGLA functions as a key regulatory molecule in coordinating expression of the NALP family of genes, genes of known oocyte-specific expression and a set of functionally un-annotated genes. FIGLA also inhibits expression of male germ cell specific genes that might otherwise disrupt normal oogenesis. Conclusion These data implicate FIGLA as a central regulator of oocyte-specific genes that play roles in folliculogenesis, fertilization and early development.
Collapse
Affiliation(s)
- Saurabh Joshi
- Laboratory of Cellular and Developmental Biology, NIDDK, National Institutes of Health, Bethesda, MD 20892, USA
| | - Holly Davies
- Laboratory of Cellular and Developmental Biology, NIDDK, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lauren Porter Sims
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Shawn E Levy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Jurrien Dean
- Laboratory of Cellular and Developmental Biology, NIDDK, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
22
|
Stafford P, Brun M. Three methods for optimization of cross-laboratory and cross-platform microarray expression data. Nucleic Acids Res 2007; 35:e72. [PMID: 17478523 PMCID: PMC1904274 DOI: 10.1093/nar/gkl1133] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Microarray gene expression data becomes more valuable as our confidence in the results grows. Guaranteeing data quality becomes increasingly important as microarrays are being used to diagnose and treat patients (1–4). The MAQC Quality Control Consortium, the FDA's Critical Path Initiative, NCI's caBIG and others are implementing procedures that will broadly enhance data quality. As GEO continues to grow, its usefulness is constrained by the level of correlation across experiments and general applicability. Although RNA preparation and array platform play important roles in data accuracy, pre-processing is a user-selected factor that has an enormous effect. Normalization of expression data is necessary, but the methods have specific and pronounced effects on precision, accuracy and historical correlation. As a case study, we present a microarray calibration process using normalization as the adjustable parameter. We examine the impact of eight normalizations across both Agilent and Affymetrix expression platforms on three expression readouts: (1) sensitivity and power, (2) functional/biological interpretation and (3) feature selection and classification error. The reader is encouraged to measure their own discordant data, whether cross-laboratory, cross-platform or across any other variance source, and to use their results to tune the adjustable parameters of their laboratory to ensure increased correlation.
Collapse
Affiliation(s)
- Phillip Stafford
- Biodesign Institute, Arizona State University, Center for Innovations in Medicine, Tempe, AZ, USA
| | | |
Collapse
|
23
|
Ruzanov P, Riddle DL, Marra MA, McKay SJ, Jones SM. Genes that may modulate longevity in C. elegans in both dauer larvae and long-lived daf-2 adults. Exp Gerontol 2007; 42:825-39. [PMID: 17543485 PMCID: PMC2755518 DOI: 10.1016/j.exger.2007.04.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2006] [Revised: 03/29/2007] [Accepted: 04/03/2007] [Indexed: 10/23/2022]
Abstract
We used Serial Analysis of Gene Expression (SAGE) to compare the global transcription profiles of long-lived mutant daf-2 adults and dauer larvae, aiming to identify aging-related genes based on similarity of expression patterns. Genes that are expressed similarly in both long-lived types potentially define a common life-extending program. Comparison of eight SAGE libraries yielded a set of 120 genes, the expression of which was significantly different in long-lived worms vs. normal adults. The gene annotations indicate a strong link between oxidative stress and life span, further supporting the hypothesis that metabolic activity is a major determinant in longevity. The SAGE data show changes in mRNA levels for electron transport chain components, elevated expression of glyoxylate shunt enzymes and significantly reduced expression for components of the TCA cycle in longer-lived nematodes. We propose a model for enhanced longevity through a cytochrome c oxidase-mediated reduction in reactive oxygen species commonly held to be a major contributor to aging.
Collapse
Affiliation(s)
- Peter Ruzanov
- Genome Sciences Centre, BC Cancer Research Centre, Ste 100-570 West 7th Ave Vancouver, BC V5Z 4S6 Canada
| | - Donald L. Riddle
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T 1Z4 Canada
| | - Marco A. Marra
- Genome Sciences Centre, BC Cancer Research Centre, Ste 100-570 West 7th Ave Vancouver, BC V5Z 4S6 Canada
| | - Sheldon J. McKay
- Genome Sciences Centre, BC Cancer Research Centre, Ste 100-570 West 7th Ave Vancouver, BC V5Z 4S6 Canada
| | - Steven M Jones
- Genome Sciences Centre, BC Cancer Research Centre, Ste 100-570 West 7th Ave Vancouver, BC V5Z 4S6 Canada
| |
Collapse
|
24
|
Brambilla A, Tarroni P. The GeneTrawler®: mapping potential drug targets in human and rat tissues. Expert Opin Ther Targets 2007; 11:567-80. [PMID: 17373885 DOI: 10.1517/14728222.11.4.567] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Expression data are an important element of target identification and validation. The authors have established an automated high-throughput method based on real time quantitative polymerase chain reaction, called the GeneTrawler, for the characterization of pharmaceutical targets on an annotated collection of human tissues. The authors have conducted a variability analysis of the system, which demonstrates that the majority of the variability between expression levels determined is due to biologic variation between samples, rather than technical variation due to imprecision of the method. Gene expression maps, generated with this carefully controlled system provide a large, reliable, consistent data set. The authors have used this system to characterize the expression of > 100 genes, and here they show the expression profile of SUR1 in order to illustrate its use. The authors were able to confirm SUR1 expression in the lung, which was suggested on the basis of pharmacologic experiments but has not previously been confirmed by mRNA detection. The data also show SUR1 expression in tissues that have been associated with some of the side effects seen with SUR1 modulators. This and other examples demonstrate that the GeneTrawler is useful to gauge the suitability of a prospective therapeutic target, to fully exploit a known drug target, or to identify and help validate new hypothetical druggable targets to fuel drug discovery pipelines.
Collapse
Affiliation(s)
- Andrea Brambilla
- Axxam, San Raffaele Biomedical Science Park, Via Olgettina 58, 20132 Milan, Italy
| | | |
Collapse
|
25
|
Wang SM. Understanding SAGE data. Trends Genet 2006; 23:42-50. [PMID: 17109989 DOI: 10.1016/j.tig.2006.11.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2006] [Revised: 10/05/2006] [Accepted: 11/01/2006] [Indexed: 02/08/2023]
Abstract
Serial analysis of gene expression (SAGE) is a method for identifying and quantifying transcripts from eukaryotic genomes. Since its invention, SAGE has been widely applied to analyzing gene expression in many biological and medical studies. Vast amounts of SAGE data have been collected and more than a thousand SAGE-related studies have been published since the mid-1990s. The principle of SAGE has been developed to address specific issues such as determination of normal gene structure and identification of abnormal genome structural changes. This review focuses on the general features of SAGE data, including the specificity of SAGE tags with respect to their original transcripts, the quantitative nature of SAGE data for differentially expressed genes, the reproducibility, the comparability of SAGE with microarray and the future potential of SAGE. Understanding these basic features should aid the proper interpretation of SAGE data to address biological and medical questions.
Collapse
Affiliation(s)
- San Ming Wang
- Center for Functional Genomics, ENH Research Institute, Robert H. Lurie Comprehensive Cancer Center, Northwestern University, 1001 University Place, Evanston, IL 60201, USA.
| |
Collapse
|
26
|
Griffith OL, Melck A, Jones SJM, Wiseman SM. Meta-analysis and meta-review of thyroid cancer gene expression profiling studies identifies important diagnostic biomarkers. J Clin Oncol 2006; 24:5043-51. [PMID: 17075124 DOI: 10.1200/jco.2006.06.7330] [Citation(s) in RCA: 225] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
PURPOSE An estimated 4% to 7% of the population will develop a clinically significant thyroid nodule during their lifetime. In many cases, preoperative diagnoses by needle biopsy are inconclusive. Thus, there is a clear need for improved diagnostic tests to distinguish malignant from benign thyroid tumors. The recent development of high-throughput molecular analytic techniques should allow the rapid evaluation of new diagnostic markers. However, researchers are faced with an overwhelming number of potential markers from numerous thyroid cancer expression profiling studies. MATERIALS AND METHODS To address this challenge, we have carried out a comprehensive meta-review of thyroid cancer biomarkers from 21 published studies. A gene ranking system that considers the number of comparisons in agreement, total number of samples, average fold-change and direction of change was devised. RESULTS We have observed that genes are consistently reported by multiple studies at a highly significant rate (P < .05). Comparison with a meta-analysis of studies reprocessed from raw data showed strong concordance with our method. CONCLUSION Our approach represents a useful method for identifying consistent gene expression markers when raw data are unavailable. A review of the top 12 candidates revealed well known thyroid cancer markers such as MET, TFF3, SERPINA1, TIMP1, FN1, and TPO as well as relatively novel or uncharacterized genes such as TGFA, QPCT, CRABP1, FCGBP, EPS8 and PROS1. These candidates should help to develop a panel of markers with sufficient sensitivity and specificity for the diagnosis of thyroid tumors in a clinical setting.
Collapse
Affiliation(s)
- Obi L Griffith
- Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, Canada
| | | | | | | |
Collapse
|
27
|
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 2006; 35:D760-5. [PMID: 17099226 PMCID: PMC1669752 DOI: 10.1093/nar/gkl887] [Citation(s) in RCA: 1017] [Impact Index Per Article: 56.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Gene Expression Omnibus (GEO) repository at the National Center for Biotechnology Information (NCBI) archives and freely disseminates microarray and other forms of high-throughput data generated by the scientific community. The database has a minimum information about a microarray experiment (MIAME)-compliant infrastructure that captures fully annotated raw and processed data. Several data deposit options and formats are supported, including web forms, spreadsheets, XML and Simple Omnibus Format in Text (SOFT). In addition to data storage, a collection of user-friendly web-based interfaces and applications are available to help users effectively explore, visualize and download the thousands of experiments and tens of millions of gene expression patterns stored in GEO. This paper provides a summary of the GEO database structure and user facilities, and describes recent enhancements to database design, performance, submission format options, data query and retrieval utilities. GEO is accessible at
Collapse
Affiliation(s)
- Tanya Barrett
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Huttenhower C, Hibbs M, Myers C, Troyanskaya OG. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 2006; 22:2890-7. [PMID: 17005538 DOI: 10.1093/bioinformatics/btl492] [Citation(s) in RCA: 110] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
MOTIVATION The diverse microarray datasets that have become available over the past several years represent a rich opportunity and challenge for biological data mining. Many supervised and unsupervised methods have been developed for the analysis of individual microarray datasets. However, integrated analysis of multiple datasets can provide a broader insight into genetic regulation of specific biological pathways under a variety of conditions. RESULTS To aid in the analysis of such large compendia of microarray experiments, we present Microarray Experiment Functional Integration Technology (MEFIT), a scalable Bayesian framework for predicting functional relationships from integrated microarray datasets. Furthermore, MEFIT predicts these functional relationships within the context of specific biological processes. All results are provided in the context of one or more specific biological functions, which can be provided by a biologist or drawn automatically from catalogs such as the Gene Ontology (GO). Using MEFIT, we integrated 40 Saccharomyces cerevisiae microarray datasets spanning 712 unique conditions. In tests based on 110 biological functions drawn from the GO biological process ontology, MEFIT provided a 5% or greater performance increase for 54 functions, with a 5% or more decrease in performance in only two functions.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | | | | | | |
Collapse
|
29
|
Siddiqui AS, Delaney AD, Schnerch A, Griffith OL, Jones SJM, Marra MA. Sequence biases in large scale gene expression profiling data. Nucleic Acids Res 2006; 34:e83. [PMID: 16840527 PMCID: PMC1524917 DOI: 10.1093/nar/gkl404] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
We present the results of a simple, statistical assay that measures the G+C content sensitivity bias of gene expression experiments without the requirement of a duplicate experiment. We analyse five gene expression profiling methods: Affymetrix GeneChip, Long Serial Analysis of Gene Expression (LongSAGE), LongSAGELite, 'Classic' Massively Parallel Signature Sequencing (MPSS) and 'Signature' MPSS. We demonstrate the methods have systematic and random errors leading to a different G+C content sensitivity. The relationship between this experimental error and the G+C content of the probe set or tag that identifies each gene influences whether the gene is detected and, if detected, the level of gene expression measured. LongSAGE has the least bias, while Signature MPSS shows a strong bias to G+C rich tags and Affymetrix data show different bias depending on the data processing method (MAS 5.0, RMA or GC-RMA). The bias in the Affymetrix data primarily impacts genes expressed at lower levels. Despite the larger sampling of the MPSS library, SAGE identifies significantly more genes (60% more RefSeq genes in a single comparison).
Collapse
Affiliation(s)
| | | | | | | | | | - Marco A. Marra
- To whom correspondence should be addressed at Genome Sciences Centre, Suite 100, 570 West 7th Avenue, Vancouver BC, Canada V5Z 4S6. Tel: 604 877 6082; Fax: 604 877 6085;
| |
Collapse
|
30
|
Bukowski R, Hankins GDV, Saade GR, Anderson GD, Thornton S. Labor-associated gene expression in the human uterine fundus, lower segment, and cervix. PLoS Med 2006; 3:e169. [PMID: 16768543 PMCID: PMC1475650 DOI: 10.1371/journal.pmed.0030169] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/19/2005] [Accepted: 01/31/2006] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Preterm labor, failure to progress, and postpartum hemorrhage are the common causes of maternal and neonatal mortality or morbidity. All result from defects in the complex mechanisms controlling labor, which coordinate changes in the uterine fundus, lower segment, and cervix. We aimed to assess labor-associated gene expression profiles in these functionally distinct areas of the human uterus by using microarrays. METHODS AND FINDINGS Samples of uterine fundus, lower segment, and cervix were obtained from patients at term (mean +/- SD = 39.1 +/- 0.5 wk) prior to the onset of labor (n = 6), or in active phase of labor with spontaneous onset (n = 7). Expression of 12,626 genes was evaluated using microarrays (Human Genome U95A; Affymetrix) and compared between labor and non-labor samples. Genes with the largest labor-associated change and the lowest variability in expression are likely to be fundamental for parturition, so gene expression was ranked accordingly. From 500 genes with the highest rank we identified genes with similar expression profiles using two independent clustering techniques. Sets of genes with a probability of chance grouping by both techniques less than 0.01 represented 71.2%, 81.8%, and 79.8% of the 500 genes in the fundus, lower segment, and cervix, respectively. We identified 14, 14, and 12 those sets of genes in the fundus, lower segment, and cervix, respectively. This enabled networks of co-regulated and co-expressed genes to be discovered. Many genes within the same cluster shared similar functions or had functions pertinent to the process of labor. CONCLUSIONS Our results provide support for many of the established processes of parturition and also describe novel-to-labor genes not previously associated with this process. The elucidation of these mechanisms likely to be fundamental for controlling labor is an important prerequisite to the development of effective treatments for major obstetric problems--including prematurity, with its long-term consequences to the health of mother and offspring.
Collapse
Affiliation(s)
- Radek Bukowski
- Department of Obstetrics and Gynecology, University of Texas Medical Branch at Galveston, Galveston, Texas, USA.
| | | | | | | | | |
Collapse
|
31
|
Blanco E, Messeguer X, Smith TF, Guigó R. Transcription factor map alignment of promoter regions. PLoS Comput Biol 2006; 2:e49. [PMID: 16733547 PMCID: PMC1464811 DOI: 10.1371/journal.pcbi.0020049] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2005] [Accepted: 03/31/2006] [Indexed: 11/18/2022] Open
Abstract
We address the problem of comparing and characterizing the promoter regions of genes with similar expression patterns. This remains a challenging problem in sequence analysis, because often the promoter regions of co-expressed genes do not show discernible sequence conservation. In our approach, thus, we have not directly compared the nucleotide sequence of promoters. Instead, we have obtained predictions of transcription factor binding sites, annotated the predicted sites with the labels of the corresponding binding factors, and aligned the resulting sequences of labels--to which we refer here as transcription factor maps (TF-maps). To obtain the global pairwise alignment of two TF-maps, we have adapted an algorithm initially developed to align restriction enzyme maps. We have optimized the parameters of the algorithm in a small, but well-curated, collection of human-mouse orthologous gene pairs. Results in this dataset, as well as in an independent much larger dataset from the CISRED database, indicate that TF-map alignments are able to uncover conserved regulatory elements, which cannot be detected by the typical sequence alignments.
Collapse
Affiliation(s)
- Enrique Blanco
- Research Group in Biomedical Informatics, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
| | | | | | | |
Collapse
|
32
|
Robertson G, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith OL, Zhang X, Pan Y, Hassel M, Sleumer MC, Pan W, Pleasance ED, Chuang M, Hao H, Li YY, Robertson N, Fjell C, Li B, Montgomery SB, Astakhova T, Zhou J, Sander J, Siddiqui AS, Jones SJM. cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res 2006; 34:D68-73. [PMID: 16381958 PMCID: PMC1347438 DOI: 10.1093/nar/gkj075] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2005] [Revised: 10/08/2005] [Accepted: 10/08/2005] [Indexed: 11/30/2022] Open
Abstract
We describe cisRED, a database for conserved regulatory elements that are identified and ranked by a genome-scale computational system (www.cisred.org). The database and high-throughput predictive pipeline are designed to address diverse target genomes in the context of rapidly evolving data resources and tools. Motifs are predicted in promoter regions using multiple discovery methods applied to sequence sets that include corresponding sequence regions from vertebrates. We estimate motif significance by applying discovery and post-processing methods to randomized sequence sets that are adaptively derived from target sequence sets, retain motifs with p-values below a threshold and identify groups of similar motifs and co-occurring motif patterns. The database offers information on atomic motifs, motif groups and patterns. It is web-accessible, and can be queried directly, downloaded or installed locally.
Collapse
Affiliation(s)
- G Robertson
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|