1
|
Temporal clustering analysis of endothelial cell gene expression following exposure to a conventional radiotherapy dose fraction using Gaussian process clustering. PLoS One 2018; 13:e0204960. [PMID: 30281653 PMCID: PMC6169916 DOI: 10.1371/journal.pone.0204960] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Accepted: 09/15/2018] [Indexed: 12/31/2022] Open
Abstract
The vascular endothelium is considered as a key cell compartment for the response to ionizing radiation of normal tissues and tumors, and as a promising target to improve the differential effect of radiotherapy in the future. Following radiation exposure, the global endothelial cell response covers a wide range of gene, miRNA, protein and metabolite expression modifications. Changes occur at the transcriptional, translational and post-translational levels and impact cell phenotype as well as the microenvironment by the production and secretion of soluble factors such as reactive oxygen species, chemokines, cytokines and growth factors. These radiation-induced dynamic modifications of molecular networks may control the endothelial cell phenotype and govern recruitment of immune cells, stressing the importance of clearly understanding the mechanisms which underlie these temporal processes. A wide variety of time series data is commonly used in bioinformatics studies, including gene expression, protein concentrations and metabolomics data. The use of clustering of these data is still an unclear problem. Here, we introduce kernels between Gaussian processes modeling time series, and subsequently introduce a spectral clustering algorithm. We apply the methods to the study of human primary endothelial cells (HUVECs) exposed to a radiotherapy dose fraction (2 Gy). Time windows of differential expressions of 301 genes involved in key cellular processes such as angiogenesis, inflammation, apoptosis, immune response and protein kinase were determined from 12 hours to 3 weeks post-irradiation. Then, 43 temporal clusters corresponding to profiles of similar expressions, including 49 genes out of 301 initially measured, were generated according to the proposed method. Forty-seven transcription factors (TFs) responsible for the expression of clusters of genes were predicted from sequence regulatory elements using the MotifMap system. Their temporal profiles of occurrences were established and clustered. Dynamic network interactions and molecular pathways of TFs and differential genes were finally explored, revealing key node genes and putative important cellular processes involved in tissue infiltration by immune cells following exposure to a radiotherapy dose fraction.
Collapse
|
2
|
Repeated Selection of Alternatively Adapted Haplotypes Creates Sweeping Genomic Remodeling in Stickleback. Genetics 2018; 209:921-939. [PMID: 29794240 PMCID: PMC6028257 DOI: 10.1534/genetics.117.300610] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2017] [Accepted: 05/21/2018] [Indexed: 01/06/2023] Open
Abstract
Heterogeneous genetic divergence can accumulate across the genome when populations adapt to different habitats while still exchanging alleles. How long does diversification take and how much of the genome is affected? When divergence occurs in parallel from standing genetic variation, how often are the same haplotypes involved? We explore these questions using restriction site-associated DNA sequencing genotyping data and show that broad-scale genomic repatterning, fueled by copious standing variation, can emerge in just dozens of generations in replicate natural populations of threespine stickleback fish (Gasterosteus aculeatus). After the catastrophic 1964 Alaskan earthquake, marine stickleback colonized newly created ponds on seismically uplifted islands. We find that freshwater fish in these young ponds differ from their marine ancestors across the same genomic segments previously shown to have diverged in much older lake populations. Outside of these core divergent regions the genome shows no population structure across the ocean-freshwater divide, consistent with strong local selection acting in alternative environments on stickleback populations still connected by significant gene flow. Reinforcing this inference, a majority of divergent haplotypes that are at high frequency in ponds are detectable in the sea, even across great geographic distances. Building upon previous population genomics work in this model species, our data suggest that a long history of divergent selection and gene flow among stickleback populations in oceanic and freshwater habitats has maintained polymorphisms of alternatively adapted DNA sequences that facilitate parallel evolution.
Collapse
|
3
|
GEsture: an online hand-drawing tool for gene expression pattern search. PeerJ 2018; 6:e4927. [PMID: 29942676 PMCID: PMC6015481 DOI: 10.7717/peerj.4927] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 05/18/2018] [Indexed: 01/21/2023] Open
Abstract
Gene expression profiling data provide useful information for the investigation of biological function and process. However, identifying a specific expression pattern from extensive time series gene expression data is not an easy task. Clustering, a popular method, is often used to classify similar expression genes, however, genes with a 'desirable' or 'user-defined' pattern cannot be efficiently detected by clustering methods. To address these limitations, we developed an online tool called GEsture. Users can draw, or graph a curve using a mouse instead of inputting abstract parameters of clustering methods. GEsture explores genes showing similar, opposite and time-delay expression patterns with a gene expression curve as input from time series datasets. We presented three examples that illustrate the capacity of GEsture in gene hunting while following users' requirements. GEsture also provides visualization tools (such as expression pattern figure, heat map and correlation network) to display the searching results. The result outputs may provide useful information for researchers to understand the targets, function and biological processes of the involved genes.
Collapse
|
4
|
Generalized Correlation Coefficient for Non-Parametric Analysis of Microarray Time-Course Data. J Integr Bioinform 2017; 14:/j/jib.2017.14.issue-2/jib-2017-0011/jib-2017-0011.xml. [PMID: 28753536 PMCID: PMC6042830 DOI: 10.1515/jib-2017-0011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 04/04/2017] [Indexed: 11/15/2022] Open
Abstract
Modeling complex time-course patterns is a challenging issue in microarray study due to complex gene expression patterns in response to the time-course experiment. We introduce the generalized correlation coefficient and propose a combinatory approach for detecting, testing and clustering the heterogeneous time-course gene expression patterns. Application of the method identified nonlinear time-course patterns in high agreement with parametric analysis. We conclude that the non-parametric nature in the generalized correlation analysis could be an useful and efficient tool for analyzing microarray time-course data and for exploring the complex relationships in the omics data for studying their association with disease and health.
Collapse
|
5
|
Alignment of time course gene expression data and the classification of developmentally driven genes with hidden Markov models. BMC Bioinformatics 2015; 16:196. [PMID: 26084333 PMCID: PMC4472167 DOI: 10.1186/s12859-015-0634-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2014] [Accepted: 06/01/2015] [Indexed: 11/26/2022] Open
Abstract
Background We consider data from a time course microarray experiment that was conducted on grapevines over the development cycle of the grape berries at two different vineyards in South Australia. Although the underlying biological process of berry development is the same at both vineyards, there are differences in the timing of the development due to local conditions. We aim to align the data from the two vineyards to enable an integrated analysis of the gene expression and use the alignment of the expression profiles to classify likely developmental function. Results We present a novel alignment method based on hidden Markov models (HMMs) and use the method to align the motivating grapevine data. We show that our alignment method is robust against subsets of profiles that are not suitable for alignment, investigate alignment diagnostics under the model and demonstrate the classification of developmentally driven genes. Conclusions The classification of developmentally driven genes both validates that the alignment we obtain is meaningful and also gives new evidence that can be used to identify the role of genes with unknown function. Using our alignment methodology, we find at least 1279 grapevine probe sets with no current annotated function that are likely to be controlled in a developmental manner. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0634-9) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Interpolation based consensus clustering for gene expression time series. BMC Bioinformatics 2015; 16:117. [PMID: 25888019 PMCID: PMC4407314 DOI: 10.1186/s12859-015-0541-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2013] [Accepted: 02/27/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Unsupervised analyses such as clustering are the essential tools required to interpret time-series expression data from microarrays. Several clustering algorithms have been developed to analyze gene expression data. Early methods such as k-means, hierarchical clustering, and self-organizing maps are popular for their simplicity. However, because of noise and uncertainty of measurement, these common algorithms have low accuracy. Moreover, because gene expression is a temporal process, the relationship between successive time points should be considered in the analyses. In addition, biological processes are generally continuous; therefore, the datasets collected from time series experiments are often found to have an insufficient number of data points and, as a result, compensation for missing data can also be an issue. RESULTS An affinity propagation-based clustering algorithm for time-series gene expression data is proposed. The algorithm explores the relationship between genes using a sliding-window mechanism to extract a large number of features. In addition, the time-course datasets are resampled with spline interpolation to predict the unobserved values. Finally, a consensus process is applied to enhance the robustness of the method. Some real gene expression datasets were analyzed to demonstrate the accuracy and efficiency of the algorithm. CONCLUSION The proposed algorithm has benefitted from the use of cubic B-splines interpolation, sliding-window, affinity propagation, gene relativity graph, and a consensus process, and, as a result, provides both appropriate and effective clustering of time-series gene expression data. The proposed method was tested with gene expression data from the Yeast galactose dataset, the Yeast cell-cycle dataset (Y5), and the Yeast sporulation dataset, and the results illustrated the relationships between the expressed genes, which may give some insights into the biological processes involved.
Collapse
|
7
|
Toxicity mechanisms identification via gene set enrichment analysis of time-series toxicogenomics data: impact of time and concentration. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2015; 49:4618-26. [PMID: 25785649 PMCID: PMC6321746 DOI: 10.1021/es505199f] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
The advance in high-throughput "toxicogenomics" technologies, which allows for concurrent monitoring of cellular responses globally upon exposure to chemical toxicants, presents promises for next-generation toxicity assessment. It is recognized that cellular responses to toxicants have a highly dynamic nature, and exhibit both temporal complexity and dose-response shifts. Most current gene enrichment or pathway analysis lack the recognition of the inherent correlation within time series data, and may potentially miss important pathways or yield biased and inconsistent results that ignore dynamic patterns and time-sensitivity. In this study, we investigated the application of two score metrics for GSEA (gene set enrichment analysis) to rank the genes that consider the temporal gene expression profile. One applies a novel time series CPCA (common principal components analysis) to generate scores for genes based on their contributions to the common temporal variation among treatments for a given chemical at different concentrations. Another one employs an integrated altered gene expression quantifier-TELI (transcriptional effect level index) that integrates altered gene expression magnitude over the exposure time. By comparing the GSEA results using two different ranking metrics for examining the dynamic responses of reporter cells treated with various dose levels of three model toxicants, mitomycin C, hydrogen peroxide, and lead nitrate, the analysis identified and revealed different toxicity mechanisms of these chemicals that exhibit chemical-specific, as well as time-aware and dose-sensitive nature. The ability, advantages, and disadvantages of varying ranking metrics were discussed. These findings support the notion that toxicity bioassays should account for the cells' complex dynamic responses, thereby implying that both data acquisition and data analysis should look beyond simple traditional end point responses.
Collapse
|
8
|
Detecting time periods of differential gene expression using Gaussian processes: an application to endothelial cells exposed to radiotherapy dose fraction. Bioinformatics 2014; 31:728-35. [DOI: 10.1093/bioinformatics/btu699] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
9
|
|
10
|
Abstract
Background Time-course gene expression experiments are useful tools for exploring biological processes. In this type of experiments, gene expression changes are monitored along time. Unfortunately, replication of time series is still costly and usually long time course do not have replicates. Many approaches have been proposed to deal with this data structure, but none of them in the field of pathway analysis. Pathway analyses have acquired great relevance for helping the interpretation of gene expression data. Several methods have been proposed to this aim: from the classical enrichment to the more complex topological analysis that gains power from the topology of the pathway. None of them were devised to identify temporal variations in time course data. Results Here we present timeClip, a topology based pathway analysis specifically tailored to long time series without replicates. timeClip combines dimension reduction techniques and graph decomposition theory to explore and identify the portion of pathways that is most time-dependent. In the first step, timeClip selects the time-dependent pathways; in the second step, the most time dependent portions of these pathways are highlighted. We used timeClip on simulated data and on a benchmark dataset regarding mouse muscle regeneration model. Our approach shows good performance on different simulated settings. On the real dataset, we identify 76 time-dependent pathways, most of which known to be involved in the regeneration process. Focusing on the 'mTOR signaling pathway' we highlight the timing of key processes of the muscle regeneration: from the early pathway activation through growth factor signals to the late burst of protein production needed for the fiber regeneration. Conclusions timeClip represents a new improvement in the field of time-dependent pathway analysis. It allows to isolate and dissect pathways characterized by time-dependent components. Furthermore, using timeClip on a mouse muscle regeneration dataset we were able to characterize the process of muscle fiber regeneration with its correct timing.
Collapse
|
11
|
Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Comput Stat Data Anal 2014. [DOI: 10.1016/j.csda.2013.04.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
12
|
Bayesian clustering of replicated time-course gene expression data with weak signals. Ann Appl Stat 2013. [DOI: 10.1214/13-aoas650] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:845-857. [PMID: 24334380 DOI: 10.1109/tcbb.2013.9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
Collapse
|
14
|
Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm. PLoS One 2013; 8:e59795. [PMID: 23565168 PMCID: PMC3614914 DOI: 10.1371/journal.pone.0059795] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Accepted: 02/19/2013] [Indexed: 11/19/2022] Open
Abstract
We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.
Collapse
|
15
|
Classification of time series gene expression in clinical studies via integration of biological network. PLoS One 2013; 8:e58383. [PMID: 23516469 PMCID: PMC3596388 DOI: 10.1371/journal.pone.0058383] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2012] [Accepted: 02/04/2013] [Indexed: 12/24/2022] Open
Abstract
The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of-the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction.
Collapse
|
16
|
DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC SYSTEMS BIOLOGY 2012; 6:104. [PMID: 22897824 PMCID: PMC3464930 DOI: 10.1186/1752-0509-6-104] [Citation(s) in RCA: 90] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2012] [Accepted: 07/18/2012] [Indexed: 12/28/2022]
Abstract
Background Modeling dynamic regulatory networks is a major challenge since much of the protein-DNA interaction data available is static. The Dynamic Regulatory Events Miner (DREM) uses a Hidden Markov Model-based approach to integrate this static interaction data with time series gene expression leading to models that can determine when transcription factors (TFs) activate genes and what genes they regulate. DREM has been used successfully in diverse areas of biological research. However, several issues were not addressed by the original version. Results DREM 2.0 is a comprehensive software for reconstructing dynamic regulatory networks that supports interactive graphical or batch mode. With version 2.0 a set of new features that are unique in comparison with other softwares are introduced. First, we provide static interaction data for additional species. Second, DREM 2.0 now accepts continuous binding values and we added a new method to utilize TF expression levels when searching for dynamic models. Third, we added support for discriminative motif discovery, which is particularly powerful for species with limited experimental interaction data. Finally, we improved the visualization to support the new features. Combined, these changes improve the ability of DREM 2.0 to accurately recover dynamic regulatory networks and make it much easier to use it for analyzing such networks in several species with varying degrees of interaction information. Conclusions DREM 2.0 provides a unique framework for constructing and visualizing dynamic regulatory networks. DREM 2.0 can be downloaded from: www.sb.cs.cmu.edu/drem.
Collapse
|
17
|
Fast MCMC sampling for hidden Markov Models to determine copy number variations. BMC Bioinformatics 2011; 12:428. [PMID: 22047014 PMCID: PMC3371636 DOI: 10.1186/1471-2105-12-428] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Accepted: 11/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems. RESULTS We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kd-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling. CONCLUSIONS We test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches. AVAILABILITY An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.
Collapse
|
18
|
Detection and interpretation of metabolite-transcript coresponses using combined profiling data. ACTA ACUST UNITED AC 2011; 27:i357-65. [PMID: 21685093 PMCID: PMC3117345 DOI: 10.1093/bioinformatics/btr231] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Motivation: Studying the interplay between gene expression and metabolite levels can yield important information on the physiology of stress responses and adaptation strategies. Performing transcriptomics and metabolomics in parallel during time-series experiments represents a systematic way to gain such information. Several combined profiling datasets have been added to the public domain and they form a valuable resource for hypothesis generating studies. Unfortunately, detecting coresponses between transcript levels and metabolite abundances is non-trivial: they cannot be assumed to overlap directly with underlying biochemical pathways and they may be subject to time delays and obscured by considerable noise. Results: Our aim was to predict pathway comemberships between metabolites and genes based on their coresponses to applied stress. We found that in the presence of strong noise and time-shifted responses, a hidden Markov model-based similarity outperforms the simpler Pearson correlation but performs comparably or worse in their absence. Therefore, we propose a supervised method that applies pathway information to summarize similarity statistics to a consensus statistic that is more informative than any of the single measures. Using four combined profiling datasets, we show that comembership between metabolites and genes can be predicted for numerous KEGG pathways; this opens opportunities for the detection of transcriptionally regulated pathways and novel metabolically related genes. Availability: A command-line software tool is available at http://www.cin.ufpe.br/~igcf/Metabolites. Contact:henning@psc.riken.jp; igcf@cin.ufpe.br Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
19
|
Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics 2011; 12:399. [PMID: 21995452 PMCID: PMC3228548 DOI: 10.1186/1471-2105-12-399] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2011] [Accepted: 10/13/2011] [Indexed: 11/29/2022] Open
Abstract
Background Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques. Results We present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles. Conclusions By incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.
Collapse
|
20
|
A growth curve model with fractional polynomials for analysing incomplete time-course data in microarray gene expression studies. Adv Bioinformatics 2011; 2011:261514. [PMID: 21966290 PMCID: PMC3182337 DOI: 10.1155/2011/261514] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2011] [Revised: 06/07/2011] [Accepted: 08/02/2011] [Indexed: 11/20/2022] Open
Abstract
Identifying the various gene expression response patterns is a challenging issue in expression microarray time-course experiments. Due to heterogeneity in the regulatory reaction among thousands of genes tested, it is impossible to manually characterize a parametric form for each of the time-course pattern in a gene by gene manner. We introduce a growth curve model with fractional polynomials to automatically capture the various time-dependent expression patterns and meanwhile efficiently handle missing values due to incomplete observations. For each gene, our procedure compares the performances among fractional polynomial models with power terms from a set of fixed values that offer a wide range of curve shapes and suggests a best fitting model. After a limited simulation study, the model has been applied to our human in vivo irritated epidermis data with missing observations to investigate time-dependent transcriptional responses to a chemical irritant. Our method was able to identify the various nonlinear time-course expression trajectories. The integration of growth curves with fractional polynomials provides a flexible way to model different time-course patterns together with model selection and significant gene identification strategies that can be applied in microarray-based time-course gene expression experiments with missing observations.
Collapse
|
21
|
pGQL: A probabilistic graphical query language for gene expression time courses. BioData Min 2011; 4:9. [PMID: 21501515 PMCID: PMC3096586 DOI: 10.1186/1756-0381-4-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 04/18/2011] [Indexed: 11/24/2022] Open
Abstract
Background Timeboxes are graphical user interface widgets that were proposed to specify queries on time course data. As queries can be very easily defined, an exploratory analysis of time course data is greatly facilitated. While timeboxes are effective, they have no provisions for dealing with noisy data or data with fluctuations along the time axis, which is very common in many applications. In particular, this is true for the analysis of gene expression time courses, which are mostly derived from noisy microarray measurements at few unevenly sampled time points. From a data mining point of view the robust handling of data through a sound statistical model is of great importance. Results We propose probabilistic timeboxes, which correspond to a specific class of Hidden Markov Models, that constitutes an established method in data mining. Since HMMs are a particular class of probabilistic graphical models we call our method Probabilistic Graphical Query Language. Its implementation was realized in the free software package pGQL. We evaluate its effectiveness in exploratory analysis on a yeast sporulation data set. Conclusions We introduce a new approach to define dynamic, statistical queries on time course data. It supports an interactive exploration of reasonably large amounts of data and enables users without expert knowledge to specify fairly complex statistical models with ease. The expressivity of our approach is by its statistical nature greater and more robust with respect to amplitude and frequency fluctuation than the prior, deterministic timeboxes.
Collapse
|
22
|
Classifying short gene expression time-courses with Bayesian estimation of piecewise constant functions. ACTA ACUST UNITED AC 2011; 27:946-52. [PMID: 21266444 DOI: 10.1093/bioinformatics/btr037] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Analyzing short time-courses is a frequent and relevant problem in molecular biology, as, for example, 90% of gene expression time-course experiments span at most nine time-points. The biological or clinical questions addressed are elucidating gene regulation by identification of co-expressed genes, predicting response to treatment in clinical, trial-like settings or classifying novel toxic compounds based on similarity of gene expression time-courses to those of known toxic compounds. The latter problem is characterized by irregular and infrequent sample times and a total lack of prior assumptions about the incoming query, which comes in stark contrast to clinical settings and requires to implicitly perform a local, gapped alignment of time series. The current state-of-the-art method (SCOW) uses a variant of dynamic time warping and models time series as higher order polynomials (splines). RESULTS We suggest to model time-courses monitoring response to toxins by piecewise constant functions, which are modeled as left-right Hidden Markov Models. A Bayesian approach to parameter estimation and inference helps to cope with the short, but highly multivariate time-courses. We improve prediction accuracy by 7% and 4%, respectively, when classifying toxicology and stress response data. We also reduce running times by at least a factor of 140; note that reasonable running times are crucial when classifying response to toxins. In conclusion, we have demonstrated that appropriate reduction of model complexity can result in substantial improvements both in classification performance and running time. AVAILABILITY A Python package implementing the methods described is freely available under the GPL from http://bioinformatics.rutgers.edu/Software/MVQueries/.
Collapse
|
23
|
Module discovery by exhaustive search for densely connected, co-expressed regions in biomolecular interaction networks. PLoS One 2010; 5:e13348. [PMID: 21049092 PMCID: PMC2963598 DOI: 10.1371/journal.pone.0013348] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2009] [Accepted: 06/06/2010] [Indexed: 01/09/2023] Open
Abstract
Background Computational prediction of functionally related groups of genes (functional modules) from large-scale data is an important issue in computational biology. Gene expression experiments and interaction networks are well studied large-scale data sources, available for many not yet exhaustively annotated organisms. It has been well established, when analyzing these two data sources jointly, modules are often reflected by highly interconnected (dense) regions in the interaction networks whose participating genes are co-expressed. However, the tractability of the problem had remained unclear and methods by which to exhaustively search for such constellations had not been presented. Methodology/Principal Findings We provide an algorithmic framework, referred to as Densely Connected Biclustering (DECOB), by which the aforementioned search problem becomes tractable. To benchmark the predictive power inherent to the approach, we computed all co-expressed, dense regions in physical protein and genetic interaction networks from human and yeast. An automatized filtering procedure reduces our output which results in smaller collections of modules, comparable to state-of-the-art approaches. Our results performed favorably in a fair benchmarking competition which adheres to standard criteria. We demonstrate the usefulness of an exhaustive module search, by using the unreduced output to more quickly perform GO term related function prediction tasks. We point out the advantages of our exhaustive output by predicting functional relationships using two examples. Conclusion/Significance We demonstrate that the computation of all densely connected and co-expressed regions in interaction networks is an approach to module discovery of considerable value. Beyond confirming the well settled hypothesis that such co-expressed, densely connected interaction network regions reflect functional modules, we open up novel computational ways to comprehensively analyze the modular organization of an organism based on prevalent and largely available large-scale datasets. Availability Software and data sets are available at http://www.sfu.ca/~ester/software/DECOB.zip.
Collapse
|
24
|
A temporal precedence based clustering method for gene expression microarray data. BMC Bioinformatics 2010; 11:68. [PMID: 20113513 PMCID: PMC2841598 DOI: 10.1186/1471-2105-11-68] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2009] [Accepted: 01/30/2010] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not. RESULTS A gene-association matrix is constructed by testing temporal relationships between pairs of genes using the Granger causality test. The association matrix is further analyzed using a graph-theoretic technique to detect highly connected components representing interesting biological modules. We test our approach on synthesized datasets and real biological datasets obtained for Arabidopsis thaliana. We show the effectiveness of our approach by analyzing the results using the existing biological literature. We also report interesting structural properties of the association network commonly desired in any biological system. CONCLUSIONS Our experiments on synthesized and real microarray datasets show that our approach produces encouraging results. The method is simple in implementation and is statistically traceable at each step. The method can produce sets of functionally related genes which can be further used for reverse-engineering of gene circuits.
Collapse
|
25
|
PyMix--the python mixture package--a tool for clustering of heterogeneous biological data. BMC Bioinformatics 2010; 11:9. [PMID: 20053276 PMCID: PMC2823712 DOI: 10.1186/1471-2105-11-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2009] [Accepted: 01/06/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications. RESULTS PyMix - the Python mixture package implements algorithms and data structures for clustering with basic and advanced mixture models. The advanced models include context-specific independence mixtures, mixtures of dependence trees and semi-supervised learning. PyMix is licenced under the GNU General Public licence (GPL). PyMix has been successfully used for the analysis of biological sequence, complex disease and gene expression data. CONCLUSIONS PyMix is a useful tool for cluster analysis of biological data. Due to the general nature of the framework, PyMix can be applied to a wide range of applications and data sets.
Collapse
|
26
|
Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC STRUCTURAL BIOLOGY 2009; 9:68. [PMID: 19857261 PMCID: PMC2777906 DOI: 10.1186/1472-6807-9-68] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Accepted: 10/26/2009] [Indexed: 03/20/2023]
Abstract
BACKGROUND The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level. RESULTS We have developed an extension of the context-specific independence mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach. CONCLUSION The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.
Collapse
|
27
|
Modified variational Bayes EM estimation of hidden Markov tree model of cell lineages. Bioinformatics 2009; 25:2824-30. [DOI: 10.1093/bioinformatics/btp456] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
28
|
Abstract
Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known theoretical issues such as high dimension of feature spaces versus few examples, noise and missing data apply. Special care is needed when designing classification procedures that support personalized diagnosis and choice of treatment. Here, we particularly focus on classification of interferon-β (IFNβ) treatment response in Multiple Sclerosis (MS) patients which has attracted substantial attention in the recent past. Half of the patients remain unaffected by IFNβ treatment, which is still the standard. For them the treatment should be timely ceased to mitigate the side effects. Results: We propose constrained estimation of mixtures of hidden Markov models as a methodology to classify patient response to IFNβ treatment. The advantages of our approach are that it takes the temporal nature of the data into account and its robustness with respect to noise, missing data and mislabeled samples. Moreover, mixture estimation enables to explore the presence of response sub-groups of patients on the transcriptional level. We clearly outperformed all prior approaches in terms of prediction accuracy, raising it, for the first time, >90%. Additionally, we were able to identify potentially mislabeled samples and to sub-divide the good responders into two sub-groups that exhibited different transcriptional response programs. This is supported by recent findings on MS pathology and therefore may raise interesting clinical follow-up questions. Availability: The method is implemented in the GQL framework and is available at http://www.ghmm.org/gql. Datasets are available at http://www.cin.ufpe.br/∼igcf/MSConst Contact:igcf@cin.ufpe.br Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
29
|
Dynamic-model-based method for selecting significantly expressed genes from time-course expression profiles. ACTA ACUST UNITED AC 2009; 14:16-22. [PMID: 19527959 DOI: 10.1109/titb.2009.2025125] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
This paper proposes a dynamic-model-based method for selecting significantly expressed (SE) genes from their time-course expression profiles. A gene is considered to be SE if its time-course expression profile is more likely time-dependent than random. The proposed method describes a time-dependent gene expression profile by a nonzero-order autoregressive (AR) model, and a time-independent gene expression profile by a zero-order AR model. Akaike information criterion (AIC) is used to compare the models and subsequently determine whether a time-course gene expression profile is time-independent or time-dependent. The performance of the proposed method is investigated on both a synthetic dataset and a real-life biological dataset in terms of the false discovery rate (FDR) and the false nondiscovery rate (FNR). The results show that the proposed method is valid for selecting SE genes from their time-course expression profiles.
Collapse
|
30
|
An unsupervised conditional random fields approach for clustering gene expression time series. ACTA ACUST UNITED AC 2008; 24:2467-73. [PMID: 18718949 DOI: 10.1093/bioinformatics/btn375] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION There is a growing interest in extracting statistical patterns from gene expression time-series data, in which a key challenge is the development of stable and accurate probabilistic models. Currently popular models, however, would be computationally prohibitive unless some independence assumptions are made to describe large-scale data. We propose an unsupervised conditional random fields (CRF) model to overcome this problem by progressively infusing information into the labelling process through a small variable voting pool. RESULTS An unsupervised CRF model is proposed for efficient analysis of gene expression time series and is successfully applied to gene class discovery and class prediction. The proposed model treats each time series as a random field and assigns an optimal cluster label to each time series, so as to partition the time series into clusters without a priori knowledge about the number of clusters and the initial centroids. Another advantage of the proposed method is the relaxation of independence assumptions.
Collapse
|
31
|
Abstract
Motivation: Classification of tissues using static gene-expression data has received considerable attention. Recently, a growing number of expression datasets are measured as a time series. Methods that are specifically designed for this temporal data can both utilize its unique features (temporal evolution of profiles) and address its unique challenges (different response rates of patients in the same class). Results: We present a method that utilizes hidden Markov models (HMMs) for the classification task. We use HMMs with less states than time points leading to an alignment of the different patient response rates. To focus on the differences between the two classes we develop a discriminative HMM classifier. Unlike the traditional generative HMM, discriminative HMM can use examples from both classes when learning the model for a specific class. We have tested our method on both simulated and real time series expression data. As we show, our method improves upon prior methods and can suggest markers for specific disease and response stages that are not found when using traditional classifiers. Availability: Matlab implementation is available from http://www.cs.cmu.edu/~thlin/tram/ Contact:zivbj@cs.cmu.edu
Collapse
|
32
|
Partial mixture model for tight clustering of gene expression time-course. BMC Bioinformatics 2008; 9:287. [PMID: 18564420 PMCID: PMC2492882 DOI: 10.1186/1471-2105-9-287] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2007] [Accepted: 06/18/2008] [Indexed: 11/29/2022] Open
Abstract
Background Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.
Collapse
|
33
|
Semi-supervised learning for the identification of syn-expressed genes from fused microarray and in situ image data. BMC Bioinformatics 2008; 8 Suppl 10:S3. [PMID: 18269697 PMCID: PMC2230504 DOI: 10.1186/1471-2105-8-s10-s3] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene expression measurements during the development of the fly Drosophila melanogaster are routinely used to find functional modules of temporally co-expressed genes. Complimentary large data sets of in situ RNA hybridization images for different stages of the fly embryo elucidate the spatial expression patterns. RESULTS Using a semi-supervised approach, constrained clustering with mixture models, we can find clusters of genes exhibiting spatio-temporal similarities in expression, or syn-expression. The temporal gene expression measurements are taken as primary data for which pairwise constraints are computed in an automated fashion from raw in situ images without the need for manual annotation. We investigate the influence of these pairwise constraints in the clustering and discuss the biological relevance of our results. CONCLUSION Spatial information contributes to a detailed, biological meaningful analysis of temporal gene expression data. Semi-supervised learning provides a flexible, robust and efficient framework for integrating data sources of differing quality and abundance.
Collapse
|
34
|
Novel algorithm for coexpression detection in time-varying microarray data sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:120-135. [PMID: 18245881 DOI: 10.1109/tcbb.2007.1052] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
When analyzing the results of microarray experiments, biologists generally use unsupervised categorization tools. However, such tools regard each time point as an independent dimension and utilize the Euclidean distance to compute the similarities between expressions. Furthermore, some of these methods require the number of clusters to be determined in advance, which is clearly impossible in the case of a new dataset. Therefore, this study proposes a novel scheme, designated as the Variation-based Coexpression Detection (VCD) algorithm, to analyze the trends of expressions based on their variation over time. The proposed algorithm has two advantages. First, it is unnecessary to determine the number of clusters in advance since the algorithm automatically detects those genes whose profiles are grouped together and creates patterns for these groups. Second, the algorithm features a new measurement criterion for calculating the degree of change of the expressions between adjacent time points and evaluating their trend similarities. Three real-world microarray datasets are employed to evaluate the performance of the proposed algorithm.
Collapse
|
35
|
Abstract
Monitoring the change in expression patterns over time provides the distinct possibility of unraveling the mechanistic drivers characterizing cellular responses. Gene arrays measuring the level of mRNA expression of thousands of genes simultaneously provide a method of high-throughput data collection necessary for obtaining the scope of data required for understanding the complexities of living organisms. Unraveling the coherent complex structures of transcriptional dynamics is the goal of a large family of computational methods aiming at upgrading the information content of time-course gene expression data. In this review, we summarize the qualitative characteristics of these approaches, discuss the main challenges that this type of complex data present, and, finally, explore the opportunities in the context of developing mechanistic models of cellular response.
Collapse
|
36
|
Abstract
BACKGROUND The regulatory processes that govern cell proliferation and differentiation are central to developmental biology. Particularly well studied in this respect is the lymphoid system due to its importance for basic biology and for clinical applications. Gene expression measured in lymphoid cells in several distinguishable developmental stages helps in the elucidation of underlying molecular processes, which change gradually over time and lock cells in either the B cell, T cell or Natural Killer cell lineages. Large-scale analysis of these gene expression trees requires computational support for tasks ranging from visualization, querying, and finding clusters of similar genes, to answering detailed questions about the functional roles of individual genes. RESULTS We present the first statistical framework designed to analyze gene expression data as it is collected in the course of lymphoid development through clusters of co-expressed genes and additional heterogeneous data. We introduce dependence trees for continuous variates, which model the inherent dependencies during the differentiation process naturally as gene expression trees. Several trees are combined in a mixture model to allow inference of potentially overlapping clusters of co-expressed genes. Additionally, we predict microRNA targets. CONCLUSION Computational results for several data sets from the lymphoid system demonstrate the relevance of our framework. We recover well-known biological facts and identify promising novel regulatory elements of genes and their functional assignments. The implementation of our method (licensed under the GPL) is available at http://algorithmics.molgen.mpg.de/Supplements/ExpLym/.
Collapse
|
37
|
Abstract
MOTIVATION Kinetochores are multiprotein complexes which mediate chromosome attachment to microtubules (MTs) of the mitotic spindle. They regulate MT dynamics during chromosome segregation. Our goal is to identify groups of kinetochore proteins with similar effects on MT dynamics, revealing pathways through which kinetochore proteins transform chemical and mechanical input signals into cues of MT regulation. RESULTS We have developed a hierarchical, agglomerative clustering algorithm that groups Saccharomyces cerevisiae strains based on MT-mediated chromosome dynamics measured by high-resolution live cell microscopy. Clustering is based on parameters of autoregressive moving average (ARMA) models of the probed dynamics. We have found that the regulation of wildtype MT dynamics varies with cell cycle and temperature, but not with the chromosome an MT is attached to. By clustering the dynamics of mutants, we discovered that the three genes IPL1, DAM1 and KIP3 co-regulate MT dynamics. Our study establishes the clustering of chromosome and MT dynamics by ARMA descriptors as a sensitive framework for the systematic identification of kinetochore protein subcomplexes and pathways for the regulation of MT dynamics. AVAILABILITY The clustering code, written in Matlab, can be downloaded from http://lccb.scripps.edu. ('download' hyperlink at bottom of website). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
38
|
Abstract
Background Human endogenous retroviruses (HERVs) are surviving traces of ancient retrovirus infections and now reside within the human DNA. Recently HERV expression has been detected in both normal tissues and diseased patients. However, the activities (expression levels) of individual HERV sequences are mostly unknown. Results We introduce a generative mixture model, based on Hidden Markov Models, for estimating the activities of the individual HERV sequences from EST (expressed sequence tag) databases. We use the model to estimate the relative activities of 181 HERVs. We also empirically justify a faster heuristic method for HERV activity estimation and use it to estimate the activities of 2450 HERVs. The majority of the HERV activities were previously unknown. Conclusion (i) Our methods estimate activity accurately based on experiments on simulated data. (ii) Our estimate on real data shows that 7% of the HERVs are active. The active ones are spread unevenly into HERV groups and relatively uniformly in terms of estimated age. HERVs with the retroviral env gene are more often active than HERVs without env. Few of the active HERVs have open reading frames for retroviral proteins.
Collapse
|
39
|
Bioinformatics analysis of the early inflammatory response in a rat thermal injury model. BMC Bioinformatics 2007; 8:10. [PMID: 17214898 PMCID: PMC1797813 DOI: 10.1186/1471-2105-8-10] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2006] [Accepted: 01/10/2007] [Indexed: 12/25/2022] Open
Abstract
Background Thermal injury is among the most severe forms of trauma and its effects are both local and systemic. Response to thermal injury includes cellular protection mechanisms, inflammation, hypermetabolism, prolonged catabolism, organ dysfunction and immuno-suppression. It has been hypothesized that gene expression patterns in the liver will change with severe burns, thus reflecting the role the liver plays in the response to burn injury. Characterizing the molecular fingerprint (i.e., expression profile) of the inflammatory response resulting from burns may help elucidate the activated mechanisms and suggest new therapeutic intervention. In this paper we propose a novel integrated framework for analyzing time-series transcriptional data, with emphasis on the burn-induced response within the context of the rat animal model. Our analysis robustly identifies critical expression motifs, indicative of the dynamic evolution of the inflammatory response and we further propose a putative reconstruction of the associated transcription factor activities. Results Implementation of our algorithm on data obtained from an animal (rat) burn injury study identified 281 genes corresponding to 4 unique profiles. Enrichment evaluation upon both gene ontologies and transcription factors, verifies the inflammation-specific character of the selections and the rationalization of the burn-induced inflammatory response. Conducting the transcription network reconstruction and analysis, we have identified transcription factors, including AHR, Octamer Binding Proteins, Kruppel-like Factors, and cell cycle regulators as being highly important to an organism's response to burn response. These transcription factors are notable due to their roles in pathways that play a part in the gross physiological response to burn such as changes in the immune response and inflammation. Conclusion Our results indicate that our novel selection/classification algorithm has been successful in selecting out genes with play an important role in thermal injury. Additionally, we have demonstrated the value of an integrative approach in identifying possible points of intervention, namely the activation of certain transcription factors that govern the organism's response.
Collapse
|
40
|
Exploiting binary abstractions in deciphering gene interactions. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2006; 2006:5858-5863. [PMID: 17947172 DOI: 10.1109/iembs.2006.260194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
We consider computationally reconstructing gene regulatory networks on top of the binary abstraction of gene expression state information. Unlike previous Boolean network approaches, the proposed method does not handle noisy gene expression values directly. Instead, two-valued "hidden state" information is derived from gene expression profiles using a robust statistical technique, and a gene interaction network is inferred from this hidden state information. In particular, we exploit Espresso, a well-known 2-level Boolean logic optimizer in order to determine the core network structure. The resulting gene interaction networks can be viewed as dynamic Bayesian networks, which have key advantages over more conventional Bayesian networks in terms of biological phenomena that can be represented. The authors tested the proposed method with a time-course gene expression data set from microarray experiments on anti-cancer drugs doxorubicin and paclitaxel. A gene interaction network was produced by our method, and the identified genes were validated with a public annotation database. The experimental studies we conducted suggest that the proposed method inspired by engineering systems can be a very effective tool to decipher complex gene interactions in living systems.
Collapse
|