1
|
Tripathi YM, Chatla SB, Chang YCI, Huang LS, Shieh GS. A nonlinear correlation measure with applications to gene expression data. PLoS One 2022; 17:e0270270. [PMID: 35727808 PMCID: PMC9212159 DOI: 10.1371/journal.pone.0270270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 06/07/2022] [Indexed: 11/19/2022] Open
Abstract
Nonlinear correlation exists in many types of biomedical data. Several types of pairwise gene expression in humans and other organisms show nonlinear correlation across time, e.g., genes involved in human T helper (Th17) cells differentiation, which motivated this study. The proposed procedure, called Kernelized correlation (Kc), first transforms nonlinear data on the plane via a function (kernel, usually nonlinear) to a high-dimensional (Hilbert) space. Next, we plug the transformed data into a classical correlation coefficient, e.g., Pearson’s correlation coefficient (r), to yield a nonlinear correlation measure. The algorithm to compute Kc is developed and the R code is provided online. In three simulated nonlinear cases, when noise in data is moderate, Kc with the RBF kernel (Kc-RBF) outperforms Pearson’s r and the well-known distance correlation (dCor). However, when noise in data is low, Pearson’s r and dCor perform slightly better than (equivalently to) Kc-RBF in Case 1 and 3 (in Case 2); Kendall’s tau performs worse than the aforementioned measures in all cases. In Application 1 to discover genes involved in the early Th17 cell differentiation, Kc is shown to detect the nonlinear correlations of four genes with IL17A (a known marker gene), while dCor detects nonlinear correlations of two pairs, and DESeq fails in all these pairs. Next, Kc outperforms Pearson’s and dCor, in estimating the nonlinear correlation of negatively correlated gene pairs in yeast cell cycle regulation. In conclusion, Kc is a simple and competent procedure to measure pairwise nonlinear correlations.
Collapse
Affiliation(s)
- Yogesh M. Tripathi
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
- Indian Institute of Technology Patna, Bihta, India
| | - Suneel Babu Chatla
- Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
- Department of Mathematical Sciences, University of Texas at El Paso, El Paso, Texas, United States of America
| | | | - Li-Shan Huang
- Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
- * E-mail: (LSH); (GSS)
| | - Grace S. Shieh
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Genome and Systems Biology Degree Program, Academia Sinica, National Taiwan University, Taipei, Taiwan
- Data Science Degree Program, Academia Sinica, National Taiwan University, Taipei, Taiwan
- * E-mail: (LSH); (GSS)
| |
Collapse
|
2
|
Elhabashy H, Merino F, Alva V, Kohlbacher O, Lupas AN. Exploring protein-protein interactions at the proteome level. Structure 2022; 30:462-475. [DOI: 10.1016/j.str.2022.02.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 10/26/2021] [Accepted: 02/02/2022] [Indexed: 02/08/2023]
|
3
|
Hu J, Qin H, Fan X. Can ODE gene regulatory models neglect time lag or measurement scaling? Bioinformatics 2020; 36:4058-4064. [PMID: 32324854 DOI: 10.1093/bioinformatics/btaa268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 04/14/2020] [Accepted: 04/16/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Many ordinary differential equation (ODE) models have been introduced to replace linear regression models for inferring gene regulatory relationships from time-course gene expression data. But, since the observed data are usually not direct measurements of the gene products or there is an unknown time lag in gene regulation, it is problematic to directly apply traditional ODE models or linear regression models. RESULTS We introduce a lagged ODE model to infer lagged gene regulatory relationships from time-course measurements, which are modeled as linear transformation of the gene products. A time-course microarray dataset from a yeast cell-cycle study is used for simulation assessment of the methods and real data analysis. The results show that our method, by considering both time lag and measurement scaling, performs much better than other linear and ODE models. It indicates the necessity of explicitly modeling the time lag and measurement scaling in ODE gene regulatory models. AVAILABILITY AND IMPLEMENTATION R code is available at https://www.sta.cuhk.edu.hk/xfan/share/lagODE.zip.
Collapse
Affiliation(s)
- Jie Hu
- Department of Probability and Statistics, School of Mathematical Science, Xiamen University, Xiamen, Fujian, China
| | - Huihui Qin
- Department of Applied Mathematics, Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
4
|
Phoemphon S, So-In C, Niyato D(T. A hybrid model using fuzzy logic and an extreme learning machine with vector particle swarm optimization for wireless sensor network localization. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.01.004] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
5
|
Detection of Somatic Mutations in Exome Sequencing of Tumor-only Samples. Sci Rep 2017; 7:15959. [PMID: 29162841 PMCID: PMC5698426 DOI: 10.1038/s41598-017-14896-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Accepted: 09/22/2017] [Indexed: 12/27/2022] Open
Abstract
Due to lack of normal samples in clinical diagnosis and to reduce costs, detection of small-scale mutations from tumor-only samples is required but remains relatively unexplored. We developed an algorithm (GATKcan) augmenting GATK with two statistics and machine learning to detect mutations in cancer. The averaged performance of GATKcan in ten experiments outperformed GATK in detecting mutations of randomly sampled 231 from 241 TCGA endometrial tumors (EC). In external validations, GATKcan outperformed GATK in TCGA breast cancer (BC), ovarian cancer (OC) and melanoma tumors, in terms of Matthews correlation coefficient (MCC) and precision, where MCC takes both sensitivity and specificity into account. Further, GATKcan reduced high fractions of false positives detected by GATK. In mutation detection of somatic variants, classified commonly by VarScan 2 and MuTect from the called variants in BC, OC and melanoma, ranked by adjusted MCC (adjusted precision) GATKcan was the top 1, followed by MuTect, VarScan 2 and GATK. Importantly, GATKcan enables detection of mutations when alternate alleles exist in normal samples. These results suggest that GATKcan trained by a cancer is able to detect mutations in future patients with the same type of cancer and is likely applicable to other cancers with similar mutations.
Collapse
|
6
|
He Z, Zhang J, Yuan X, Liu Z, Liu B, Tuo S, Liu Y. Network based stratification of major cancers by integrating somatic mutation and gene expression data. PLoS One 2017; 12:e0177662. [PMID: 28520777 PMCID: PMC5433734 DOI: 10.1371/journal.pone.0177662] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 05/01/2017] [Indexed: 11/20/2022] Open
Abstract
The stratification of cancer into subtypes that are significantly associated with clinical outcomes is beneficial for targeted prognosis and treatment. In this study, we integrated somatic mutation and gene expression data to identify clusters of patients. In contrast to previous studies, we constructed cancer-type-specific significant co-expression networks (SCNs) rather than using a fixed gene network across all cancers, such as the network-based stratification (NBS) method, which ignores cancer heterogeneity. For each type of cancer, the gene expression data were used to construct the SCN network, while the gene somatic mutation data were mapped onto the network, propagated, and used for further clustering. For the clustering, we adopted an improved network-regularized non-negative matrix factorization (netNMF) (netNMF_HC) for a more precise classification. We applied our method to various datasets, including ovarian cancer (OV), lung adenocarcinoma (LUAD) and uterine corpus endometrial carcinoma (UCEC) cohorts derived from the TCGA (The Cancer Genome Atlas) project. Based on the results, we evaluated the performance of our method to identify survival-relevant subtypes and further compared it to the NBS method, which adopts priori networks and netNMF algorithm. The proposed algorithm outperformed the NBS method in identifying informative cancer subtypes that were significantly associated with clinical outcomes in most cancer types we studied. In particular, our method identified survival-associated UCEC subtypes that were not identified by the NBS method. Our analysis indicated valid subtyping of patient could be applied by mutation data with cancer-type-specific SCNs and netNMF_HC for individual cancers because of specific cancer co-expression patterns and more precise clustering.
Collapse
Affiliation(s)
- Zongzhen He
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
- * E-mail:
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| | - Zhaowen Liu
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| | - Baobao Liu
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| | - Shouheng Tuo
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| | - Yajun Liu
- School of Computer Science and Technology, Xidian University, Xi’an, PR China
| |
Collapse
|
7
|
Bao W, Greenwold MJ, Sawyer RH. Using scale and feather traits for module construction provides a functional approach to chicken epidermal development. Funct Integr Genomics 2017; 17:641-651. [PMID: 28477104 DOI: 10.1007/s10142-017-0561-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2017] [Revised: 04/16/2017] [Accepted: 04/19/2017] [Indexed: 10/19/2022]
Abstract
Gene co-expression network analysis has been a research method widely used in systematically exploring gene function and interaction. Using the Weighted Gene Co-expression Network Analysis (WGCNA) approach to construct a gene co-expression network using data from a customized 44K microarray transcriptome of chicken epidermal embryogenesis, we have identified two distinct modules that are highly correlated with scale or feather development traits. Signaling pathways related to feather development were enriched in the traditional KEGG pathway analysis and functional terms relating specifically to embryonic epidermal development were also enriched in the Gene Ontology analysis. Significant enrichment annotations were discovered from customized enrichment tools such as Modular Single-Set Enrichment Test (MSET) and Medical Subject Headings (MeSH). Hub genes in both trait-correlated modules showed strong specific functional enrichment toward epidermal development. Also, regulatory elements, such as transcription factors and miRNAs, were targeted in the significant enrichment result. This work highlights the advantage of this methodology for functional prediction of genes not previously associated with scale- and feather trait-related modules.
Collapse
Affiliation(s)
- Weier Bao
- Department of Biological Sciences, University of South Carolina, Columbia, SC, 29208, USA.
| | - Matthew J Greenwold
- Department of Biological Sciences, University of South Carolina, Columbia, SC, 29208, USA
| | - Roger H Sawyer
- Department of Biological Sciences, University of South Carolina, Columbia, SC, 29208, USA
| |
Collapse
|
8
|
Mabbott NA, Baillie JK, Brown H, Freeman TC, Hume DA. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC Genomics 2013; 14:632. [PMID: 24053356 PMCID: PMC3849585 DOI: 10.1186/1471-2164-14-632] [Citation(s) in RCA: 308] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2013] [Accepted: 06/25/2013] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND The specialisation of mammalian cells in time and space requires genes associated with specific pathways and functions to be co-ordinately expressed. Here we have combined a large number of publically available microarray datasets derived from human primary cells and analysed large correlation graphs of these data. RESULTS Using the network analysis tool BioLayout Express3D we identify robust co-associations of genes expressed in a wide variety of cell lineages. We discuss the biological significance of a number of these associations, in particular the coexpression of key transcription factors with the genes that they are likely to control. CONCLUSIONS We consider the regulation of genes in human primary cells and specifically in the human mononuclear phagocyte system. Of particular note is the fact that these data do not support the identity of putative markers of antigen-presenting dendritic cells, nor classification of M1 and M2 activation states, a current subject of debate within immunological field. We have provided this data resource on the BioGPS web site (http://biogps.org/dataset/2429/primary-cell-atlas/) and on macrophages.com (http://www.macrophages.com/hu-cell-atlas).
Collapse
Affiliation(s)
- Neil A Mabbott
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Edinburgh EH25 9RG, UK
| | - J Kenneth Baillie
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Edinburgh EH25 9RG, UK
| | - Helen Brown
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Edinburgh EH25 9RG, UK
| | - Tom C Freeman
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Edinburgh EH25 9RG, UK
| | - David A Hume
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Edinburgh EH25 9RG, UK
| |
Collapse
|
9
|
Doig TN, Hume DA, Theocharidis T, Goodlad JR, Gregory CD, Freeman TC. Coexpression analysis of large cancer datasets provides insight into the cellular phenotypes of the tumour microenvironment. BMC Genomics 2013; 14:469. [PMID: 23845084 PMCID: PMC3721986 DOI: 10.1186/1471-2164-14-469] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2013] [Accepted: 06/25/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biopsies taken from individual tumours exhibit extensive differences in their cellular composition due to the inherent heterogeneity of cancers and vagaries of sample collection. As a result genes expressed in specific cell types, or associated with certain biological processes are detected at widely variable levels across samples in transcriptomic analyses. This heterogeneity also means that the level of expression of genes expressed specifically in a given cell type or process, will vary in line with the number of those cells within samples or activity of the pathway, and will therefore be correlated in their expression. RESULTS Using a novel 3D network-based approach we have analysed six large human cancer microarray datasets derived from more than 1,000 individuals. Based upon this analysis, and without needing to isolate the individual cells, we have defined a broad spectrum of cell-type and pathway-specific gene signatures present in cancer expression data which were also found to be largely conserved in a number of independent datasets. CONCLUSIONS The conserved signature of the tumour-associated macrophage is shown to be largely-independent of tumour cell type. All stromal cell signatures have some degree of correlation with each other, since they must all be inversely correlated with the tumour component. However, viewed in the context of established tumours, the interactions between stromal components appear to be multifactorial given the level of one component e.g. vasculature, does not correlate tightly with another, such as the macrophage.
Collapse
Affiliation(s)
- Tamasin N Doig
- Centre for Inflammation Research, University of Edinburgh, The Queen’s Medical Research Institute, Edinburgh, UK
| | | | | | | | | | | |
Collapse
|
10
|
Kim Y, Han S, Choi S, Hwang D. Inference of dynamic networks using time-course data. Brief Bioinform 2013; 15:212-28. [PMID: 23698724 DOI: 10.1093/bib/bbt028] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Cells execute their functions through dynamic operations of biological networks. Dynamic networks delineate the operation of biological networks in terms of temporal changes of abundances or activities of nodes (proteins and RNAs), as well as formation of new edges and disappearance of existing edges over time. Global genomic and proteomic technologies can be used to decode dynamic networks. However, using these experimental methods, it is still challenging to identify temporal transition of nodes and edges. Thus, several computational methods for estimating dynamic topological and functional characteristics of networks have been introduced. In this review, we summarize concepts and applications of these computational methods for inferring dynamic networks and further summarize methods for estimating spatial transition of biological networks.
Collapse
Affiliation(s)
- Yongsoo Kim
- POSTECH, Pohang, 790-784, Republic of Korea. Tel.: 82-54-279-2393; Fax: 82-54-279-8409;
| | | | | | | |
Collapse
|
11
|
Jiang CR, Hung YC, Chen CM, Shieh GS. Inferring Genetic Interactions via a Data-Driven Second Order Model. Front Genet 2012; 3:71. [PMID: 22563331 PMCID: PMC3342528 DOI: 10.3389/fgene.2012.00071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Accepted: 04/12/2012] [Indexed: 11/13/2022] Open
Abstract
Genetic/transcriptional regulatory interactions are shown to predict partial components of signaling pathways, which have been recognized as vital to complex human diseases. Both activator (A) and repressor (R) are known to coregulate their common target gene (T). Xu et al. (2002) proposed to model this coregulation by a fixed second order response surface (called the RS algorithm), in which T is a function of A, R, and AR. Unfortunately, the RS algorithm did not result in a sufficient number of genetic interactions (GIs) when it was applied to a group of 51 yeast genes in a pilot study. Thus, we propose a data-driven second order model (DDSOM), an approximation to the non-linear transcriptional interactions, to infer genetic and transcriptional regulatory interactions. For each triplet of genes of interest (A, R, and T), we regress the expression of T at time t + 1 on the expression of A, R, and AR at time t. Next, these well-fitted regression models (viewed as points in R3) are collected, and the center of these points is used to identify triples of genes having the A-R-T relationship or GIs. The DDSOM and RS algorithms are first compared on inferring transcriptional compensation interactions of a group of yeast genes in DNA synthesis and DNA repair using microarray gene expression data; the DDSOM algorithm results in higher modified true positive rate (about 75%) than that of the RS algorithm, checked against quantitative RT-polymerase chain reaction results. These validated GIs are reported, among which some coincide with certain interactions in DNA repair and genome instability pathways in yeast. This suggests that the DDSOM algorithm has potential to predict pathway components. Further, both algorithms are applied to predict transcriptional regulatory interactions of 63 yeast genes. Checked against the known transcriptional regulatory interactions queried from TRANSFAC, the proposed also performs better than the RS algorithm.
Collapse
Affiliation(s)
- Ci-Ren Jiang
- Institute of Statistical Science, Academia Sinica Taipei, Taiwan
| | | | | | | |
Collapse
|
12
|
|
13
|
Langfelder P, Luo R, Oldham MC, Horvath S. Is my network module preserved and reproducible? PLoS Comput Biol 2011; 7:e1001057. [PMID: 21283776 PMCID: PMC3024255 DOI: 10.1371/journal.pcbi.1001057] [Citation(s) in RCA: 685] [Impact Index Per Article: 48.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 12/13/2010] [Indexed: 01/23/2023] Open
Abstract
In many applications, one is interested in determining which of the properties of a network module change across conditions. For example, to validate the existence of a module, it is desirable to show that it is reproducible (or preserved) in an independent test network. Here we study several types of network preservation statistics that do not require a module assignment in the test network. We distinguish network preservation statistics by the type of the underlying network. Some preservation statistics are defined for a general network (defined by an adjacency matrix) while others are only defined for a correlation network (constructed on the basis of pairwise correlations between numeric variables). Our applications show that the correlation structure facilitates the definition of particularly powerful module preservation statistics. We illustrate that evaluating module preservation is in general different from evaluating cluster preservation. We find that it is advantageous to aggregate multiple preservation statistics into summary preservation statistics. We illustrate the use of these methods in six gene co-expression network applications including 1) preservation of cholesterol biosynthesis pathway in mouse tissues, 2) comparison of human and chimpanzee brain networks, 3) preservation of selected KEGG pathways between human and chimpanzee brain networks, 4) sex differences in human cortical networks, 5) sex differences in mouse liver networks. While we find no evidence for sex specific modules in human cortical networks, we find that several human cortical modules are less preserved in chimpanzees. In particular, apoptosis genes are differentially co-expressed between humans and chimpanzees. Our simulation studies and applications show that module preservation statistics are useful for studying differences between the modular structure of networks. Data, R software and accompanying tutorials can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ModulePreservation. In network applications, one is often interested in studying whether modules are preserved across multiple networks. For example, to determine whether a pathway of genes is perturbed in a certain condition, one can study whether its connectivity pattern is no longer preserved. Non-preserved modules can either be biologically uninteresting (e.g., reflecting data outliers) or interesting (e.g., reflecting sex specific modules). An intuitive approach for studying module preservation is to cross-tabulate module membership. But this approach often cannot address questions about the preservation of connectivity patterns between nodes. Thus, cross-tabulation based approaches often fail to recognize that important aspects of a network module are preserved. Cross-tabulation methods make it difficult to argue that a module is not preserved. The weak statement (“the reference module does not overlap with any of the identified test set modules”) is less relevant in practice than the strong statement (“the module cannot be found in the test network irrespective of the parameter settings of the module detection procedure”). Module preservation statistics have important applications, e.g. we show that the wiring of apoptosis genes in a human cortical network differs from that in chimpanzees.
Collapse
Affiliation(s)
- Peter Langfelder
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Rui Luo
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Michael C. Oldham
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Steve Horvath
- Departments of Human Genetics and Biostatistics, University of California, Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
14
|
Abstract
Over the past 20 years, Omics technologies emerged as the consensual denomination of holistic molecular profiling. These techniques enable parallel measurements of biological -omes, or "all constituents considered collectively", and utilize the latest advancements in transcriptomics, proteomics, metabolomics, imaging, and bioinformatics. The technological accomplishments in increasing the sensitivity and throughput of the analytical devices, the standardization of the protocols and the widespread availability of reagents made the capturing of static molecular portraits of biological systems a routine task. The next generation of time course molecular profiling already allows for extensive molecular snapshots to be taken along the trajectory of time evolution of the investigated biological systems. Such datasets provide the basis for application of the inverse scientific approach. It consists in the inference of scientific hypotheses and theories about the structure and dynamics of the investigated biological system without any a priori knowledge, solely relying on data analysis to unveil the underlying patterns. However, most temporal Omics data still contain a limited number of time points, taken over arbitrary time intervals, through measurements on biological processes shifted in time. The analysis of the resulting short and noisy time series data sets is a challenge. Traditional statistical methods for the study of static Omics datasets are of limited relevance and new methods are required. This chapter discusses such algorithms which enable the application of the inverse analysis approach to short Omics time series.
Collapse
|
15
|
Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics 2010; 11:154. [PMID: 20338053 PMCID: PMC2862045 DOI: 10.1186/1471-2105-11-154] [Citation(s) in RCA: 139] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2009] [Accepted: 03/25/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND One of main aims of Molecular Biology is the gain of knowledge about how molecular components interact each other and to understand gene function regulations. Using microarray technology, it is possible to extract measurements of thousands of genes into a single analysis step having a picture of the cell gene expression. Several methods have been developed to infer gene networks from steady-state data, much less literature is produced about time-course data, so the development of algorithms to infer gene networks from time-series measurements is a current challenge into bioinformatics research area. In order to detect dependencies between genes at different time delays, we propose an approach to infer gene regulatory networks from time-series measurements starting from a well known algorithm based on information theory. RESULTS In this paper we show how the ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) algorithm can be used for gene regulatory network inference in the case of time-course expression profiles. The resulting method is called TimeDelay-ARACNE. It just tries to extract dependencies between two genes at different time delays, providing a measure of these dependencies in terms of mutual information. The basic idea of the proposed algorithm is to detect time-delayed dependencies between the expression profiles by assuming as underlying probabilistic model a stationary Markov Random Field. Less informative dependencies are filtered out using an auto calculated threshold, retaining most reliable connections. TimeDelay-ARACNE can infer small local networks of time regulated gene-gene interactions detecting their versus and also discovering cyclic interactions also when only a medium-small number of measurements are available. We test the algorithm both on synthetic networks and on microarray expression profiles. Microarray measurements concern S. cerevisiae cell cycle, E. coli SOS pathways and a recently developed network for in vivo assessment of reverse engineering algorithms. Our results are compared with ARACNE itself and with the ones of two previously published algorithms: Dynamic Bayesian Networks and systems of ODEs, showing that TimeDelay-ARACNE has good accuracy, recall and F-score for the network reconstruction task. CONCLUSIONS Here we report the adaptation of the ARACNE algorithm to infer gene regulatory networks from time-course data, so that, the resulting network is represented as a directed graph. The proposed algorithm is expected to be useful in reconstruction of small biological directed networks from time course data.
Collapse
Affiliation(s)
- Pietro Zoppoli
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| | - Sandro Morganella
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| | - Michele Ceccarelli
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| |
Collapse
|
16
|
Chen CM, Lee C, Chuang CL, Wang CC, Shieh GS. Inferring genetic interactions via a nonlinear model and an optimization algorithm. BMC SYSTEMS BIOLOGY 2010; 4:16. [PMID: 20184777 PMCID: PMC2848194 DOI: 10.1186/1752-0509-4-16] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2009] [Accepted: 02/26/2010] [Indexed: 11/10/2022]
Abstract
Background Biochemical pathways are gradually becoming recognized as central to complex human diseases and recently genetic/transcriptional interactions have been shown to be able to predict partial pathways. With the abundant information made available by microarray gene expression data (MGED), nonlinear modeling of these interactions is now feasible. Two of the latest advances in nonlinear modeling used sigmoid models to depict transcriptional interaction of a transcription factor (TF) for a target gene, but do not model cooperative or competitive interactions of several TFs for a target. Results An S-shape model and an optimization algorithm (GASA) were developed to infer genetic interactions/transcriptional regulation of several genes simultaneously using MGED. GASA consists of a genetic algorithm (GA) and a simulated annealing (SA) algorithm, which is enhanced by a steepest gradient descent algorithm to avoid being trapped in local minimum. Using simulated data with various degrees of noise, we studied how GASA with two model selection criteria and two search spaces performed. Furthermore, GASA was shown to outperform network component analysis, the time series network inference algorithm (TSNI), GA with regular GA (GAGA) and GA with regular SA. Two applications are demonstrated. First, GASA is applied to infer a subnetwork of human T-cell apoptosis. Several of the predicted interactions are supported by the literature. Second, GASA was applied to infer the transcriptional factors of 34 cell cycle regulated targets in S. cerevisiae, and GASA performed better than one of the latest advances in nonlinear modeling, GAGA and TSNI. Moreover, GASA is able to predict multiple transcription factors for certain targets, and these results coincide with experiments confirmed data in YEASTRACT. Conclusions GASA is shown to infer both genetic interactions and transcriptional regulatory interactions well. In particular, GASA seems able to characterize the nonlinear mechanism of transcriptional regulatory interactions (TIs) in yeast, and may be applied to infer TIs in other organisms. The predicted genetic interactions of a subnetwork of human T-cell apoptosis coincide with existing partial pathways, suggesting the potential of GASA on inferring biochemical pathways.
Collapse
Affiliation(s)
- Chung-Ming Chen
- Institute of Statistical Science, Academia Sinica, No 128, Sec 2, Academia Road, Taipei 115, Taiwan
| | | | | | | | | |
Collapse
|
17
|
Chuang CL, Wu JH, Cheng CS, Shieh GS. WebPARE: web-computing for inferring genetic or transcriptional interactions. Bioinformatics 2010; 26:582-4. [PMID: 20007742 PMCID: PMC2820674 DOI: 10.1093/bioinformatics/btp684] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Summary: Inferring genetic or transcriptional interactions, when done successfully, may provide insights into biological processes or biochemical pathways of interest. Unfortunately, most computational algorithms require a certain level of programming expertise. To provide a simple web interface for users to infer interactions from time course gene expression data, we present WebPARE, which is based on the pattern recognition algorithm (PARE). For expression data, in which each type of interaction (e.g. activator target) and the corresponding paired gene expression pattern are significantly associated, PARE uses a non-linear score to classify gene pairs of interest into a few subclasses of various time lags. In each subclass, PARE learns the parameters in the decision score using known interactions from biological experiments or published literature. Subsequently, the trained algorithm predicts interactions of a similar nature. Previously, PARE was shown to infer two sets of interactions in yeast successfully. Moreover, several predicted genetic interactions coincided with existing pathways; this indicates the potential of PARE in predicting partial pathway components. Given a list of gene pairs or genes of interest and expression data, WebPARE invokes PARE and outputs predicted interactions and their networks in directed graphs. Availability: A web-computing service WebPARE is publicly available at: http://www.stat.sinica.edu.tw/WebPARE Contact:gshieh@stat.sinica.edu.tw Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cheng-Long Chuang
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan and Institute of Biomedical Engineering, National Taiwan University, Taipei 106, Taiwan
| | | | | | | |
Collapse
|
18
|
Morganella S, Zoppoli P, Ceccarelli M. IRIS: a method for reverse engineering of regulatory relations in gene networks. BMC Bioinformatics 2009; 10:444. [PMID: 20030818 PMCID: PMC2813854 DOI: 10.1186/1471-2105-10-444] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2008] [Accepted: 12/23/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ultimate aim of systems biology is to understand and describe how molecular components interact to manifest collective behaviour that is the sum of the single parts. Building a network of molecular interactions is the basic step in modelling a complex entity such as the cell. Even if gene-gene interactions only partially describe real networks because of post-transcriptional modifications and protein regulation, using microarray technology it is possible to combine measurements for thousands of genes into a single analysis step that provides a picture of the cell's gene expression. Several databases provide information about known molecular interactions and various methods have been developed to infer gene networks from expression data. However, network topology alone is not enough to perform simulations and predictions of how a molecular system will respond to perturbations. Rules for interactions among the single parts are needed for a complete definition of the network behaviour. Another interesting question is how to integrate information carried by the network topology, which can be derived from the literature, with large-scale experimental data. RESULTS Here we propose an algorithm, called inference of regulatory interaction schema (IRIS), that uses an iterative approach to map gene expression profile values (both steady-state and time-course) into discrete states and a simple probabilistic method to infer the regulatory functions of the network. These interaction rules are integrated into a factor graph model. We test IRIS on two synthetic networks to determine its accuracy and compare it to other methods. We also apply IRIS to gene expression microarray data for the Saccharomyces cerevisiae cell cycle and for human B-cells and compare the results to literature findings. CONCLUSIONS IRIS is a rapid and efficient tool for the inference of regulatory relations in gene networks. A topological description of the network and a matrix of gene expression profiles are required as input to the algorithm. IRIS maps gene expression data onto discrete values and then computes regulatory functions as conditional probability tables. The suitability of the method is demonstrated for synthetic data and microarray data. The resulting network can also be embedded in a factor graph model.
Collapse
Affiliation(s)
- Sandro Morganella
- Department of Biological and Environmental Sciences, University of Sannio, Benevento, Italy.
| | | | | |
Collapse
|
19
|
Chuang CL, Hung K, Chen CM, Shieh GS. Uncovering transcriptional interactions via an adaptive fuzzy logic approach. BMC Bioinformatics 2009; 10:400. [PMID: 19961622 PMCID: PMC2797023 DOI: 10.1186/1471-2105-10-400] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2009] [Accepted: 12/06/2009] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND To date, only a limited number of transcriptional regulatory interactions have been uncovered. In a pilot study integrating sequence data with microarray data, a position weight matrix (PWM) performed poorly in inferring transcriptional interactions (TIs), which represent physical interactions between transcription factors (TF) and upstream sequences of target genes. Inferring a TI means that the promoter sequence of a target is inferred to match the consensus sequence motifs of a potential TF, and their interaction type such as AT or RT is also predicted. Thus, a robust PWM (rPWM) was developed to search for consensus sequence motifs. In addition to rPWM, one feature extracted from ChIP-chip data was incorporated to identify potential TIs under specific conditions. An interaction type classifier was assembled to predict activation/repression of potential TIs using microarray data. This approach, combining an adaptive (learning) fuzzy inference system and an interaction type classifier to predict transcriptional regulatory networks, was named AdaFuzzy. RESULTS AdaFuzzy was applied to predict TIs using real genomics data from Saccharomyces cerevisiae. Following one of the latest advances in predicting TIs, constrained probabilistic sparse matrix factorization (cPSMF), and using 19 transcription factors (TFs), we compared AdaFuzzy to four well-known approaches using over-representation analysis and gene set enrichment analysis. AdaFuzzy outperformed these four algorithms. Furthermore, AdaFuzzy was shown to perform comparably to 'ChIP-experimental method' in inferring TIs identified by two sets of large scale ChIP-chip data, respectively. AdaFuzzy was also able to classify all predicted TIs into one or more of the four promoter architectures. The results coincided with known promoter architectures in yeast and provided insights into transcriptional regulatory mechanisms. CONCLUSION AdaFuzzy successfully integrates multiple types of data (sequence, ChIP, and microarray) to predict transcriptional regulatory networks. The validated success in the prediction results implies that AdaFuzzy can be applied to uncover TIs in yeast.
Collapse
Affiliation(s)
- Cheng-Long Chuang
- Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Kenneth Hung
- Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan
| | - Chung-Ming Chen
- Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan
| | - Grace S Shieh
- Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
20
|
Zeng T, Li J. Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways. Nucleic Acids Res 2009; 38:e1. [PMID: 19854949 PMCID: PMC2800212 DOI: 10.1093/nar/gkp822] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Positive correlation can be diversely instantiated as shifting, scaling or geometric pattern, and it has been extensively explored for time-course gene expression data and pathway analysis. Recently, biological studies emerge a trend focusing on the notion of negative correlations such as opposite expression patterns, complementary patterns and self-negative regulation of transcription factors (TFs). These biological ideas and primitive observations motivate us to formulate and investigate the problem of maximizing negative correlations. The objective is to discover all maximal negative correlations of statistical and biological significance from time-course gene expression data for enhancing our understanding of molecular pathways. Given a gene expression matrix, a maximal negative correlation is defined as an activation–inhibition two-way expression pattern (AIE pattern). We propose a parameter-free algorithm to enumerate the complete set of AIE patterns from a data set. This algorithm can identify significant negative correlations that cannot be identified by the traditional clustering/biclustering methods. To demonstrate the biological usefulness of AIE patterns in the analysis of molecular pathways, we conducted deep case studies for AIE patterns identified from Yeast cell cycle data sets. In particular, in the analysis of the Lysine biosynthesis pathway, new regulation modules and pathway components were inferred according to a significant negative correlation which is likely caused by a co-regulation of the TFs at the higher layer of the biological network. We conjecture that maximal negative correlations between genes are actually a common characteristic in molecular pathways, which can provide insights into the cell stress response study, drug response evaluation, etc.
Collapse
Affiliation(s)
- Tao Zeng
- School of Computer Engineering & Bioinformatics Research Center, Nanyang Technological University, Singapore
| | | |
Collapse
|
21
|
Chuang CL, Chen CM, Wong WS, Tsai KN, Chan EC, Jiang JA. A robust correlation estimator and nonlinear recurrent model to infer genetic interactions in Saccharomyces cerevisiae and pathways of pulmonary disease in Homo sapiens. Biosystems 2009; 98:160-75. [PMID: 19527770 DOI: 10.1016/j.biosystems.2009.05.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2008] [Revised: 05/08/2009] [Accepted: 05/28/2009] [Indexed: 12/13/2022]
Abstract
In order to identify genes involved in complex diseases, it is crucial to study the genetic interactions at the systems biology level. By utilizing modern high throughput microarray technology, it has become feasible to obtain gene expressions data and turn it into knowledge that explains the regulatory behavior of genes. In this study, an unsupervised nonlinear model was proposed to infer gene regulatory networks on a genome-wide scale. The proposed model consists of two components, a robust correlation estimator and a nonlinear recurrent model. The robust correlation estimator was used to initialize the parameters of the nonlinear recurrent curve-fitting model. Then the initialized model was used to fit the microarray data. The model was used to simulate the underlying nonlinear regulatory mechanisms in biological organisms. The proposed algorithm was applied to infer the regulatory mechanisms of the general network in Saccharomyces cerevisiae and the pulmonary disease pathways in Homo sapiens. The proposed algorithm requires no prior biological knowledge to predict linkages between genes. The prediction results were checked against true positive links obtained from the YEASTRACT database, the TRANSFAC database, and the KEGG database. By checking the results with known interactions, we showed that the proposed algorithm could determine some meaningful pathways, many of which are supported by the existing literature.
Collapse
Affiliation(s)
- Cheng-Long Chuang
- Institute of Biomedical Engineering, National Taiwan University, Taiwan
| | | | | | | | | | | |
Collapse
|
22
|
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008; 9:559. [PMID: 19114008 PMCID: PMC2631488 DOI: 10.1186/1471-2105-9-559] [Citation(s) in RCA: 15862] [Impact Index Per Article: 933.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2008] [Accepted: 12/29/2008] [Indexed: 12/11/2022] Open
Abstract
Background Correlation networks are increasingly being used in bioinformatics applications. For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples. Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets. These methods have been successfully applied in various biological contexts, e.g. cancer, mouse genetics, yeast genetics, and analysis of brain imaging data. While parts of the correlation network methodology have been described in separate publications, there is a need to provide a user-friendly, comprehensive, and consistent software implementation and an accompanying tutorial. Results The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis. The package includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software. Along with the R package we also present R software tutorials. While the methods development was motivated by gene expression data, the underlying data mining approach can be applied to a variety of different settings. Conclusion The WGCNA package provides R functions for weighted correlation network analysis, e.g. co-expression network analysis of gene expression data. The R package along with its source code and additional material are freely available at .
Collapse
Affiliation(s)
- Peter Langfelder
- Department of Human Genetics and Department of Biostatistics, University of California, Los Angeles, CA 90095, USA.
| | | |
Collapse
|
23
|
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008. [PMID: 19114008 DOI: 10.1186/1471‐2105‐9‐559] [Citation(s) in RCA: 136] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Correlation networks are increasingly being used in bioinformatics applications. For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples. Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets. These methods have been successfully applied in various biological contexts, e.g. cancer, mouse genetics, yeast genetics, and analysis of brain imaging data. While parts of the correlation network methodology have been described in separate publications, there is a need to provide a user-friendly, comprehensive, and consistent software implementation and an accompanying tutorial. RESULTS The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis. The package includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software. Along with the R package we also present R software tutorials. While the methods development was motivated by gene expression data, the underlying data mining approach can be applied to a variety of different settings. CONCLUSION The WGCNA package provides R functions for weighted correlation network analysis, e.g. co-expression network analysis of gene expression data. The R package along with its source code and additional material are freely available at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA.
Collapse
Affiliation(s)
- Peter Langfelder
- Department of Human Genetics and Department of Biostatistics, University of California, Los Angeles, CA 90095, USA.
| | | |
Collapse
|