1
|
Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022; 23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| |
Collapse
|
2
|
Acharya S, Saha S, Pradhan P. Multi-Factored Gene-Gene Proximity Measures Exploiting Biological Knowledge Extracted from Gene Ontology: Application in Gene Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:207-219. [PMID: 29994130 DOI: 10.1109/tcbb.2018.2849362] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
To describe the cellular functions of proteins and genes, a potential dynamic vocabulary is Gene Ontology (GO), which comprises of three sub-ontologies namely, Biological-process, Cellular-component, and Molecular-function. It has several applications in the field of bioinformatics like annotating/measuring gene-gene or protein-protein semantic similarity, identifying genes/proteins by their GO annotations for disease gene and target discovery, etc. To determine semantic similarity between genes, several semantic measures have been proposed in literature, which involve information content of GO-terms, GO tree structure, or the combination of both. But, most of the existing semantic similarity measures do not consider different topological and information theoretic aspects of GO-terms collectively. Inspired by this fact, in this article, we have first proposed three novel semantic similarity/distance measures for genes covering different aspects of GO-tree. These are further implanted in the frameworks of well-known multi-objective and single-objective based clustering algorithms to determine functionally similar genes. For comparative analysis, 10 popular existing GO based semantic similarity/distance measures and tools are also considered. Experimental results on Mouse genome, Yeast, and Human genome datasets evidently demonstrate the supremacy of multi-objective clustering algorithms in association with proposed multi-factored similarity/distance measures. Clustering outcomes are further validated by conducting some biological/statistical significance tests. Supplementary information is available at https://www.iitp.ac.in/sriparna/journals.html.
Collapse
|
3
|
Acharya S, Saha S, Pradhan P. Novel symmetry-based gene-gene dissimilarity measures utilizing Gene Ontology: Application in gene clustering. Gene 2018; 679:341-351. [PMID: 30184472 DOI: 10.1016/j.gene.2018.08.062] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 08/21/2018] [Accepted: 08/21/2018] [Indexed: 11/25/2022]
Abstract
In recent years DNA microarray technology, leading to the generation of high-volume biological data, has gained significant attention. To analyze this high volume gene-expression data, one such powerful tool is Clustering. For any clustering algorithm, its efficiency majorly depends upon the underlying similarity/dissimilarity measure. During the analysis of such data often there is a need to further explore the similarity of genes not only with respect to their expression values but also with respect to their functional annotations, which can be obtained from Gene Ontology (GO) databases. In the existing literature, several novel clustering and bi-clustering approaches were proposed to identify co-regulated genes from gene-expression datasets. Identifying co-regulated genes from gene expression data misses some important biological information about functionalities of genes, which is necessary to identify semantically related genes. In this paper, we have proposed sixteen different semantic gene-gene dissimilarity measures utilizing biological information of genes retrieved from a global biological database namely Gene Ontology (GO). Four proximity measures, viz. Euclidean, Cosine, point symmetry and line symmetry are utilized along with four different representations of gene-GO-term annotation vectors to develop total sixteen gene-gene dissimilarity measures. In order to illustrate the profitability of developed dissimilarity measures, some multi-objective as well as single-objective clustering algorithms are applied utilizing proposed measures to identify functionally similar genes from Mouse genome and Yeast datasets. Furthermore, we have compared the performance of our proposed sixteen dissimilarity measures with three existing state-of-the-art semantic similarity and distance measures.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering, IIT Patna, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, IIT Patna, India
| | - Prasanna Pradhan
- Department of Computer Applications, Sikkim Manipal Institute of Technology, India
| |
Collapse
|
4
|
Guo Y, Alexander K, Clark AG, Grimson A, Yu H. Integrated network analysis reveals distinct regulatory roles of transcription factors and microRNAs. RNA (NEW YORK, N.Y.) 2016; 22:1663-1672. [PMID: 27604961 PMCID: PMC5066619 DOI: 10.1261/rna.048025.114] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2014] [Accepted: 07/25/2016] [Indexed: 06/06/2023]
Abstract
Analysis of transcription regulatory networks has revealed many principal features that govern gene expression regulation. MicroRNAs (miRNAs) have emerged as another major class of gene regulators that influence gene expression post-transcriptionally, but there remains a need to assess quantitatively their global roles in gene regulation. Here, we have constructed an integrated gene regulatory network comprised of transcription factors (TFs), miRNAs, and their target genes and analyzed the effect of regulation on target mRNA expression, target protein expression, protein-protein interaction, and disease association. We found that while target genes regulated by the same TFs tend to be co-expressed, co-regulation by miRNAs does not lead to co-expression assessed at either mRNA or protein levels. Analysis of interacting protein pairs in the regulatory network revealed that compared to genes co-regulated by miRNAs, a higher fraction of genes co-regulated by TFs encode proteins in the same complex. Although these results suggest that genes co-regulated by TFs are more functionally related than those co-regulated by miRNAs, genes that share either TF or miRNA regulators are more likely to cause the same disease. Further analysis on the interplay between TFs and miRNAs suggests that TFs tend to regulate intramodule/pathway clusters, while miRNAs tend to regulate intermodule/pathway clusters. These results demonstrate that although TFs and miRNAs both regulate gene expression, they occupy distinct niches in the overall regulatory network within the cell.
Collapse
Affiliation(s)
- Yu Guo
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York 14853, USA
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA
| | - Katherine Alexander
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA
| | - Andrew G Clark
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA
| | - Andrew Grimson
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York 14853, USA
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA
| |
Collapse
|
5
|
Lee WP, Lin CH. Combining Expression Data and Knowledge Ontology for Gene Clustering and Network Reconstruction. Cognit Comput 2015. [DOI: 10.1007/s12559-015-9349-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
6
|
Meng J, Li R, Luan Y. Classification by integrating plant stress response gene expression data with biological knowledge. Math Biosci 2015; 266:65-72. [PMID: 26092610 DOI: 10.1016/j.mbs.2015.06.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 05/03/2015] [Accepted: 06/05/2015] [Indexed: 12/01/2022]
Abstract
Classification of microarray data has always been a challenging task because of the enormous number of genes. In this study, a clustering method by integrating plant stress response gene expression data with biological knowledge is presented. Clustering is one of the promising tools for attribute reduction, but gene clusters are biologically uninformative. So we integrated biological knowledge into genomic analysis to help to improve the interpretation of the results. Biological similarity based on gene ontology (GO) semantic similarity was combined with gene expression data to find out biologically meaningful clusters. Affinity propagation clustering algorithm was chosen to analyze the impact of the biological similarity on the results. Based on clustering result, neighborhood rough set was used to select representative genes for each cluster. The prediction accuracy of classifiers built on reduced gene subsets indicated that our approach outperformed other classical methods. The information fusion was proven to be effective through quantitative analysis, as it could select gene subsets with high biological significance and select significant genes.
Collapse
Affiliation(s)
- Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Rui Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Yushi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| |
Collapse
|
7
|
Zhang SB, Lai JH. Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 2015; 558:108-17. [DOI: 10.1016/j.gene.2014.12.062] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 12/15/2014] [Accepted: 12/24/2014] [Indexed: 11/25/2022]
|
8
|
Peng J, Uygun S, Kim T, Wang Y, Rhee SY, Chen J. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks. BMC Bioinformatics 2015; 16:44. [PMID: 25886899 PMCID: PMC4339680 DOI: 10.1186/s12859-015-0474-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 01/26/2015] [Indexed: 01/18/2023] Open
Abstract
Background Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited. Supplementary information and software are available at http://www.msu.edu/~jinchen/NETSIM. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0474-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. .,Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA.
| | - Sahra Uygun
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA. .,Genetics Program, Michigan State University, East Lansing, MI, 48824, USA.
| | - Taehyong Kim
- Department of Plant Biology, Carnegie Institution for Science, 260 Panama St, Stanford, CA, 94305, USA.
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Seung Y Rhee
- Department of Plant Biology, Carnegie Institution for Science, 260 Panama St, Stanford, CA, 94305, USA.
| | - Jin Chen
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA. .,Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
9
|
Peng J, Li H, Jiang Q, Wang Y, Chen J. An integrative approach for measuring semantic similarities using gene ontology. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 5:S8. [PMID: 25559943 PMCID: PMC4305987 DOI: 10.1186/1752-0509-8-s5-s8] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Background Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding. Results We propose a novel integrative measure called InteGO2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories. Conclusions InteGO2 computes gene-to-gene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
Collapse
|
10
|
OrthoClust: an orthology-based network framework for clustering data across multiple species. Genome Biol 2014; 15:R100. [PMID: 25249401 PMCID: PMC4289247 DOI: 10.1186/gb-2014-15-8-r100] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Accepted: 06/26/2014] [Indexed: 01/28/2023] Open
Abstract
Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.
Collapse
|
11
|
Abstract
Background In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents. Results We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). Conclusions InteGO is a systematic and novel method to study gene functional associations. The software and description are available at http://www.msu.edu/~jinchen/InteGO.
Collapse
|
12
|
CoCiter: an efficient tool to infer gene function by assessing the significance of literature co-citation. PLoS One 2013; 8:e74074. [PMID: 24086311 PMCID: PMC3781068 DOI: 10.1371/journal.pone.0074074] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 07/30/2013] [Indexed: 01/17/2023] Open
Abstract
A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called "CoCiter" based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/.
Collapse
|
13
|
Guo Y, Wei X, Das J, Grimson A, Lipkin S, Clark A, Yu H. Dissecting disease inheritance modes in a three-dimensional protein network challenges the "guilt-by-association" principle. Am J Hum Genet 2013; 93:78-89. [PMID: 23791107 DOI: 10.1016/j.ajhg.2013.05.022] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2012] [Revised: 05/02/2013] [Accepted: 05/23/2013] [Indexed: 10/26/2022] Open
Abstract
To better understand different molecular mechanisms by which mutations lead to various human diseases, we classified 82,833 disease-associated mutations according to their inheritance modes (recessive versus dominant) and molecular types (in-frame [missense point mutations and in-frame indels] versus truncating [nonsense mutations and frameshift indels]) and systematically examined the effects of different classes of disease mutations in a three-dimensional protein interactome network with the atomic-resolution interface resolved for each interaction. We found that although recessive mutations affecting the interaction interface of two interacting proteins tend to cause the same disease, this widely accepted "guilt-by-association" principle does not apply to dominant mutations. Furthermore, recessive truncating mutations in regions encoding the same interface are much more likely to cause the same disease, even for interfaces close to the N terminus of the protein. Conversely, dominant truncating mutations tend to be enriched in regions encoding areas between interfaces. These results suggest that a significant fraction of truncating mutations can generate functional protein products. For example, TRIM27, a known cancer-associated protein, interacts with three proteins (MID2, TRIM42, and SIRPA) through two different interfaces. A dominant truncating mutation (c.1024delT [p.Tyr342Thrfs*30]) associated with ovarian carcinoma is located between the regions encoding the two interfaces; the altered protein retains its interaction with MID2 and TRIM42 through the first interface but loses its interaction with SIRPA through the second interface. Our findings will help clarify the molecular mechanisms of thousands of disease-associated genes and their tens of thousands of mutations, especially for those carrying truncating mutations, often erroneously considered "knockout" alleles.
Collapse
|
14
|
Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 2013; 29:1424-32. [PMID: 23572412 DOI: 10.1093/bioinformatics/btt160] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
15
|
Das J, Mohammed J, Yu H. Genome-scale analysis of interaction dynamics reveals organization of biological networks. ACTA ACUST UNITED AC 2012; 28:1873-8. [PMID: 22576179 DOI: 10.1093/bioinformatics/bts283] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith-Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.
Collapse
Affiliation(s)
- Jishnu Das
- Department of Biological Statistics and Computational Biology, Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | | | | |
Collapse
|
16
|
Yang H, Nepusz T, Paccanaro A. Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. ACTA ACUST UNITED AC 2012; 28:1383-9. [PMID: 22522134 DOI: 10.1093/bioinformatics/bts129] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Several measures have been recently proposed for quantifying the functional similarity between gene products according to well-structured controlled vocabularies where biological terms are organized in a tree or in a directed acyclic graph (DAG) structure. However, existing semantic similarity measures ignore two important facts. First, when calculating the similarity between two terms, they disregard the descendants of these terms. While this makes no difference when the ontology is a tree, we shall show that it has important consequences when the ontology is a DAG-this is the case, for example, with the Gene Ontology (GO). Second, existing similarity measures do not model the inherent uncertainty which comes from the fact that our current knowledge of the gene annotation and of the ontology structure is incomplete. Here, we propose a novel approach based on downward random walks that can be used to improve any of the existing similarity measures to exhibit these two properties. The approach is computationally efficient-random walks do not need to be simulated as we provide formulas to calculate their stationary distributions. RESULTS To show that our approach can potentially improve any semantic similarity measure, we test it on six different semantic similarity measures: three commonly used measures by Resnik (1999), Lin (1998), and Jiang and Conrath (1997); and three recently proposed measures: simUI, simGIC by Pesquita et al. (2008); GraSM by Couto et al. (2007); and Couto and Silva (2011). We applied these improved measures to the GO annotations of the yeast Saccharomyces cerevisiae, and tested how they correlate with sequence similarity, mRNA co-expression and protein-protein interaction data. Our results consistently show that the use of downward random walks leads to more reliable similarity measures.
Collapse
Affiliation(s)
- Haixuan Yang
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham, TW20 0EX, UK
| | | | | |
Collapse
|
17
|
Systems analysis of inflammatory bowel disease based on comprehensive gene information. BMC MEDICAL GENETICS 2012; 13:25. [PMID: 22480395 PMCID: PMC3368714 DOI: 10.1186/1471-2350-13-25] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2011] [Accepted: 04/05/2012] [Indexed: 12/19/2022]
Abstract
Background The rise of systems biology and availability of highly curated gene and molecular information resources has promoted a comprehensive approach to study disease as the cumulative deleterious function of a collection of individual genes and networks of molecules acting in concert. These "human disease networks" (HDN) have revealed novel candidate genes and pharmaceutical targets for many diseases and identified fundamental HDN features conserved across diseases. A network-based analysis is particularly vital for a study on polygenic diseases where many interactions between molecules should be simultaneously examined and elucidated. We employ a new knowledge driven HDN gene and molecular database systems approach to analyze Inflammatory Bowel Disease (IBD), whose pathogenesis remains largely unknown. Methods and Results Based on drug indications for IBD, we determined sibling diseases of mild and severe states of IBD. Approximately 1,000 genes associated with the sibling diseases were retrieved from four databases. After ranking the genes by the frequency of records in the databases, we obtained 250 and 253 genes highly associated with the mild and severe IBD states, respectively. We then calculated functional similarities of these genes with known drug targets and examined and presented their interactions as PPI networks. Conclusions The results demonstrate that this knowledge-based systems approach, predicated on functionally similar genes important to sibling diseases is an effective method to identify important components of the IBD human disease network. Our approach elucidates a previously unknown biological distinction between mild and severe IBD states.
Collapse
|
18
|
Wang X, Wei X, Thijssen B, Das J, Lipkin SM, Yu H. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat Biotechnol 2012; 30:159-64. [PMID: 22252508 DOI: 10.1038/nbt.2106] [Citation(s) in RCA: 290] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Accepted: 12/19/2011] [Indexed: 01/13/2023]
Abstract
To better understand the molecular mechanisms and genetic basis of human disease, we systematically examine relationships between 3,949 genes, 62,663 mutations and 3,453 associated disorders by generating a three-dimensional, structurally resolved human interactome. This network consists of 4,222 high-quality binary protein-protein interactions with their atomic-resolution interfaces. We find that in-frame mutations (missense point mutations and in-frame insertions and deletions) are enriched on the interaction interfaces of proteins associated with the corresponding disorders, and that the disease specificity for different mutations of the same gene can be explained by their location within an interface. We also predict 292 candidate genes for 694 unknown disease-to-gene associations with proposed molecular mechanism hypotheses. This work indicates that knowledge of how in-frame disease mutations alter specific interactions is critical to understanding pathogenesis. Structurally resolved interaction networks should be valuable tools for interpreting the wealth of data being generated by large-scale structural genomics and disease association studies.
Collapse
Affiliation(s)
- Xiujuan Wang
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, USA
| | | | | | | | | | | |
Collapse
|
19
|
Schulz MH, Köhler S, Bauer S, Robinson PN. Exact score distribution computation for ontological similarity searches. BMC Bioinformatics 2011; 12:441. [PMID: 22078312 PMCID: PMC3240574 DOI: 10.1186/1471-2105-12-441] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2011] [Accepted: 11/12/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Semantic similarity searches in ontologies are an important component of many bioinformatic algorithms, e.g., finding functionally related proteins with the Gene Ontology or phenotypically similar diseases with the Human Phenotype Ontology (HPO). We have recently shown that the performance of semantic similarity searches can be improved by ranking results according to the probability of obtaining a given score at random rather than by the scores themselves. However, to date, there are no algorithms for computing the exact distribution of semantic similarity scores, which is necessary for computing the exact P-value of a given score. RESULTS In this paper we consider the exact computation of score distributions for similarity searches in ontologies, and introduce a simple null hypothesis which can be used to compute a P-value for the statistical significance of similarity scores. We concentrate on measures based on Resnik's definition of ontological similarity. A new algorithm is proposed that collapses subgraphs of the ontology graph and thereby allows fast score distribution computation. The new algorithm is several orders of magnitude faster than the naive approach, as we demonstrate by computing score distributions for similarity searches in the HPO. It is shown that exact P-value calculation improves clinical diagnosis using the HPO compared to approaches based on sampling. CONCLUSIONS The new algorithm enables for the first time exact P-value calculation via exact score distribution computation for ontology similarity searches. The approach is applicable to any ontology for which the annotation-propagation rule holds and can improve any bioinformatic method that makes only use of the raw similarity scores. The algorithm was implemented in Java, supports any ontology in OBO format, and is available for non-commercial and academic usage under: https://compbio.charite.de/svn/hpo/trunk/src/tools/significance/
Collapse
Affiliation(s)
- Marcel H Schulz
- Max Planck Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany
- Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, 15213 Pennsylvania, USA
| | - Sebastian Köhler
- Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Sebastian Bauer
- Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Peter N Robinson
- Max Planck Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany
- Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| |
Collapse
|
20
|
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M. Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis. BMC Bioinformatics 2011; 12:203. [PMID: 21612636 PMCID: PMC3118170 DOI: 10.1186/1471-2105-12-203] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2010] [Accepted: 05/25/2011] [Indexed: 12/18/2022] Open
Abstract
Background Combining multiple evidence-types from different information sources has the potential to reveal new relationships in biological systems. The integrated information can be represented as a relationship network, and clustering the network can suggest possible functional modules. The value of such modules for gaining insight into the underlying biological processes depends on their functional coherence. The challenges that we wish to address are to define and quantify the functional coherence of modules in relationship networks, so that they can be used to infer function of as yet unannotated proteins, to discover previously unknown roles of proteins in diseases as well as for better understanding of the regulation and interrelationship between different elements of complex biological systems. Results We have defined the functional coherence of modules with respect to the Gene Ontology (GO) by considering two complementary aspects: (i) the fragmentation of the GO functional categories into the different modules and (ii) the most representative functions of the modules. We have proposed a set of metrics to evaluate these two aspects and demonstrated their utility in Arabidopsis thaliana. We selected 2355 proteins for which experimentally established protein-protein interaction (PPI) data were available. From these we have constructed five relationship networks, four based on single types of data: PPI, co-expression, co-occurrence of protein names in scientific literature abstracts and sequence similarity and a fifth one combining these four evidence types. The ability of these networks to suggest biologically meaningful grouping of proteins was explored by applying Markov clustering and then by measuring the functional coherence of the clusters. Conclusions Relationship networks integrating multiple evidence-types are biologically informative and allow more proteins to be assigned to a putative functional module. Using additional evidence types concentrates the functional annotations in a smaller number of modules without unduly compromising their consistency. These results indicate that integration of more data sources improves the ability to uncover functional association between proteins, both by allowing more proteins to be linked and producing a network where modular structure more closely reflects the hierarchy in the gene ontology.
Collapse
Affiliation(s)
- Artem Lysenko
- Centre for Mathematical and Computational Biology, Rothamsted Research, Harpenden, Herts, AL5, 2JQ, UK.
| | | | | | | | | | | | | |
Collapse
|
21
|
microRNA-122 as a regulator of mitochondrial metabolic gene network in hepatocellular carcinoma. Mol Syst Biol 2011; 6:402. [PMID: 20739924 PMCID: PMC2950084 DOI: 10.1038/msb.2010.58] [Citation(s) in RCA: 162] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2009] [Accepted: 06/29/2010] [Indexed: 02/06/2023] Open
Abstract
A moderate loss of miR-122 function correlates with up-regulation of seed-matched genes and down-regulation of mitochondrially localized genes in both human hepatocellular carcinoma and in normal mice treated with anti-miR-122 antagomir. Putative direct targets up-regulated with loss of miR-122 and secondary targets down-regulated with loss of miR-122 are conserved between human beings and mice and are rapidly regulated in vitro in response to miR-122 over- and under-expression. Loss of miR-122 secondary target expression in either tumorous or adjacent non-tumorous tissue predicts poor survival of heptatocellular carcinoma patients.
Hepatocellular carcinoma (HCC) is one of the most aggressive human malignancies, common in Asia, Africa, and in areas with endemic infections of hepatitis-B or -C viruses (HBV or HCV) (But et al, 2008). Globally, the 5-year survival rate of HCC is <5% and about 600 000 HCC patients die each year. The high mortality associated with this disease is mainly attributed to the failure to diagnose HCC patients at an early stage and a lack of effective therapies for patients with advanced stage HCC. Understanding the relationships between phenotypic and molecular changes in HCC is, therefore, of paramount importance for the development of improved HCC diagnosis and treatment methods. In this study, we examined mRNA and microRNA (miRNA)-expression profiles of tumor and adjacent non-tumor liver tissue from HCC patients. The patient population was selected from a region of endemic HBV infection, and HBV infection appears to contribute to the etiology of HCC in these patients. A total of 96 HCC patients were included in the study, of which about 88% tested positive for HBV antigen; patients testing positive for HCV antigen were excluded. Among the 220 miRNAs profiled, miR-122 was the most highly expressed miRNA in liver, and its expression was decreased almost two-fold in HCC tissue relative to adjacent non-tumor tissue, confirming earlier observations (Lagos-Quintana et al, 2002; Kutay et al, 2006; Budhu et al, 2008). Over 1000 transcripts were correlated and over 1000 transcripts were anti-correlated with miR-122 expression. Consistent with the idea that transcripts anti-correlated with miR-122 are potential miR-122 targets, the most highly anti-correlated transcripts were highly enriched for the presence of the miR-122 central seed hexamer, CACTCC, in the 3′UTR. Although the complete set of negatively correlated genes was enriched for cell-cycle genes, the subset of seed-matched genes had no significant KEGG Pathway annotation, suggesting that miR-122 is unlikely to directly regulate the cell cycle in these patients. In contrast, transcripts positively correlated with miR-122 were not enriched for 3′UTR seed matches to miR-122. Interestingly, these 1042 transcripts were enriched for genes coding for mitochondrially localized proteins and for metabolic functions. To analyze the impact of loss of miR-122 in vivo, silencing of miR-122 was performed by antisense inhibition (anti-miR-122) in wild-type mice (Figure 3). As with the genes negatively correlated with miR-122 in HCC patients, no significant biological annotation was associated with the seed-matched genes up-regulated by anti-miR-122 in mouse livers. The most significantly enriched biological annotation for anti-miR-122 down-regulated genes, as for positively correlated genes in HCC, was mitochondrial localization; the down-regulated mitochondrial genes were enriched for metabolic functions. Putative direct and downstream targets with orthologs on both the human and mouse microarrays showed significant overlap for regulations in the same direction. These overlaps defined sets of putative miR-122 primary and secondary targets. The results were further extended in the analysis of a separate dataset from 180 HCC, 40 cirrhotic, and 6 normal liver tissue samples (Figure 4), showing anti-correlation of proposed primary and secondary targets in non-healthy tissues. To validate the direct correlation between miR-122 and some of the primary and secondary targets, we determined the expression of putative targets after transfection of miR-122 mimetic into PLC/PRF/5 HCC cells, including the putative direct targets SMARCD1 and MAP3K3 (MEKK3), a target described in the literature, CAT-1 (SLC7A1), and three putative secondary targets, PPARGC1A (PGC-1α) and succinate dehydrogenase subunits A and B. As expected, the putative direct targets showed reduced expression, whereas the putative secondary target genes showed increased expression in cells over-expressing miR-122 (Figure 4). Functional classification of genes using the total ancestry method (Yu et al, 2007) identified PPARGC1A (PGC-1α) as the most connected secondary target. PPARGC1A has been proposed to function as a master regulator of mitochondrial biogenesis (Ventura-Clapier et al, 2008), suggesting that loss of PPARGC1A expression may contribute to the loss of mitochondrial gene expression correlated with loss of miR-122 expression. To further validate the link of miR-122 and PGC-1α protein, we transfected PLC/PRF/5 cells with miR-122-expression vector, and observed an increase in PGC-1α protein levels. Importantly, transfection of both miR-122 mimetic and miR-122-expression vector significantly reduced the lactate content of PLC/PRF/5 cells, whereas anti-miR-122 treatment increased lactate production. Together, the data support the function of miR-122 in mitochondrial metabolic functions. Patient survival was not directly associated with miR-122-expression levels. However, miR-122 secondary targets were expressed at significantly higher levels in both tumor and adjacent non-tumor tissues among survivors as compared with deceased patients, providing supporting evidence for the potential relevance of loss of miR-122 function in HCC patient morbidity and mortality. Overall, our findings reveal potentially new biological functions for miR-122 in liver physiology. We observed decreased expression of miR-122, a liver-specific miRNA, in HBV-associated HCC, and loss of miR-122 seemed to correlate with the decrease of mitochondrion-related metabolic pathway gene expression in HCC and in non-tumor liver tissues, a result that is consistent with the outcome of treatment of mice with anti-miR-122 and is of prognostic significance for HCC patients. Further investigation will be conducted to dissect the regulatory function of miR-122 on mitochondrial metabolism in HCC and to test whether increasing miR-122 expression can improve mitochondrial function in liver and perhaps in liver tumor tissues. Moreover, these results support the idea that primary targets of a given miRNA may be distributed over a variety of functional categories while resulting in a coordinated secondary response, potentially through synergistic action (Linsley et al, 2007). Tumorigenesis involves multistep genetic alterations. To elucidate the microRNA (miRNA)–gene interaction network in carcinogenesis, we examined their genome-wide expression profiles in 96 pairs of tumor/non-tumor tissues from hepatocellular carcinoma (HCC). Comprehensive analysis of the coordinate expression of miRNAs and mRNAs reveals that miR-122 is under-expressed in HCC and that increased expression of miR-122 seed-matched genes leads to a loss of mitochondrial metabolic function. Furthermore, the miR-122 secondary targets, which decrease in expression, are good prognostic markers for HCC. Transcriptome profiling data from additional 180 HCC and 40 liver cirrhotic patients in the same cohort were used to confirm the anti-correlation of miR-122 primary and secondary target gene sets. The HCC findings can be recapitulated in mouse liver by silencing miR-122 with antagomir treatment followed by gene-expression microarray analysis. In vitro miR-122 data further provided a direct link between induction of miR-122-controlled genes and impairment of mitochondrial metabolism. In conclusion, miR-122 regulates mitochondrial metabolism and its loss may be detrimental to sustaining critical liver function and contribute to morbidity and mortality of liver cancer patients.
Collapse
|
22
|
Kyogoku R, Fujimoto R, Ozaki T, Ohkawa T. A method for supporting retrieval of articles on protein structure analysis considering users' intention. BMC Bioinformatics 2011; 12 Suppl 1:S42. [PMID: 21342574 PMCID: PMC3044299 DOI: 10.1186/1471-2105-12-s1-s42] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In recent years, information about protein structure and function is described in a large amount of articles. However, a naive full-text search by specific keywords often fails to find desired articles, because the articles involve the ambiguous and complicated concepts that cannot be described with uniform representation. For retrieving articles on protein structure and function, it is important to consider the relevance between structural and/or functional concepts by identifying the user’s intention. Results We introduce a scheme of evaluating relevance between articles based on various biological databases and ontologies on structures and functions of proteins. The relevance, which is defined as a path length between concepts on hierarchies, is modified adaptively based on additional articles as a query in order to reflect the user’s intention. Also we implemented the retrieval system, in which the user can input some articles as a query and the related articles are retrieved and displayed on the 2D map. Conclusions The effectiveness of the proposed system was confirmed experimentally by having shown that the users can obtain easily highly related articles which reflect their intention.
Collapse
Affiliation(s)
- Riku Kyogoku
- Graduate School of System Informatics, Kobe University, Rokkodai, Nada, Kobe 657-8501, Japan.
| | | | | | | |
Collapse
|
23
|
Richards AJ, Muller B, Shotwell M, Cowart LA, Rohrer B, Lu X. Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. ACTA ACUST UNITED AC 2010; 26:i79-87. [PMID: 20529941 PMCID: PMC2881388 DOI: 10.1093/bioinformatics/btq203] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The results of initial analyses for many high-throughput technologies commonly take the form of gene or protein sets, and one of the ensuing tasks is to evaluate the functional coherence of these sets. The study of gene set function most commonly makes use of controlled vocabulary in the form of ontology annotations. For a given gene set, the statistical significance of observing these annotations or 'enrichment' may be tested using a number of methods. Instead of testing for significance of individual terms, this study is concerned with the task of assessing the global functional coherence of gene sets, for which novel metrics and statistical methods have been devised. RESULTS The metrics of this study are based on the topological properties of graphs comprised of genes and their Gene Ontology annotations. A novel aspect of these methods is that both the enrichment of annotations and the relationships among annotations are considered when determining the significance of functional coherence. We applied our methods to perform analyses on an existing database and on microarray experimental results. Here, we demonstrated that our approach is highly discriminative in terms of differentiating coherent gene sets from random ones and that it provides biologically sensible evaluations in microarray analysis. We further used examples to show the utility of graph visualization as a tool for studying the functional coherence of gene sets. AVAILABILITY The implementation is provided as a freely accessible web application at: http://projects.dbbe.musc.edu/gosteiner. Additionally, the source code written in the Python programming language, is available under the General Public License of the Free Software Foundation. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Adam J Richards
- Department of Biochemistry and Molecular Biology, Medical University of South Carolina, Charleston, SC 29425, USA
| | | | | | | | | | | |
Collapse
|
24
|
Davis MJ, Sehgal MSB, Ragan MA. Automatic, context-specific generation of Gene Ontology slims. BMC Bioinformatics 2010; 11:498. [PMID: 20929524 PMCID: PMC3098080 DOI: 10.1186/1471-2105-11-498] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Accepted: 10/07/2010] [Indexed: 11/10/2022] Open
Abstract
Background The use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual. Results Here we present an objective framework for generating customised ontology slims for specific annotated datasets, exploiting information latent in the structure of the ontology graph and in the annotation data. This framework combines ontology engineering approaches, and a data-driven algorithm that draws on graph and information theory. We illustrate this method by application to GO, generating GO slims at different information thresholds, characterising their depth of semantics and demonstrating the resulting gains in statistical power. Conclusions Our GO slim creation pipeline is available for use in conjunction with any GO-annotated dataset, and creates dataset-specific, objectively defined slims. This method is fast and scalable for application to other biomedical ontologies.
Collapse
Affiliation(s)
- Melissa J Davis
- The University of Queensland, Brisbane, QLD 4072, Australia.
| | | | | |
Collapse
|
25
|
|
26
|
Voevodski K, Teng SH, Xia Y. Spectral affinity in protein networks. BMC SYSTEMS BIOLOGY 2009; 3:112. [PMID: 19943959 PMCID: PMC2797010 DOI: 10.1186/1752-0509-3-112] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/12/2009] [Accepted: 11/29/2009] [Indexed: 01/15/2023]
Abstract
Background Protein-protein interaction (PPI) networks enable us to better understand the functional organization of the proteome. We can learn a lot about a particular protein by querying its neighborhood in a PPI network to find proteins with similar function. A spectral approach that considers random walks between nodes of interest is particularly useful in evaluating closeness in PPI networks. Spectral measures of closeness are more robust to noise in the data and are more precise than simpler methods based on edge density and shortest path length. Results We develop a novel affinity measure for pairs of proteins in PPI networks, which uses personalized PageRank, a random walk based method used in context-sensitive search on the Web. Our measure of closeness, which we call PageRank Affinity, is proportional to the number of times the smaller-degree protein is visited in a random walk that restarts at the larger-degree protein. PageRank considers paths of all lengths in a network, therefore PageRank Affinity is a precise measure that is robust to noise in the data. PageRank Affinity is also provably related to cluster co-membership, making it a meaningful measure. In our experiments on protein networks we find that our measure is better at predicting co-complex membership and finding functionally related proteins than other commonly used measures of closeness. Moreover, our experiments indicate that PageRank Affinity is very resilient to noise in the network. In addition, based on our method we build a tool that quickly finds nodes closest to a queried protein in any protein network, and easily scales to much larger biological networks. Conclusion We define a meaningful way to assess the closeness of two proteins in a PPI network, and show that our closeness measure is more biologically significant than other commonly used methods. We also develop a tool, accessible at http://xialab.bu.edu/resources/pnns, that allows the user to quickly find nodes closest to a queried vertex in any protein network available from BioGRID or specified by the user.
Collapse
|
27
|
Köhler S, Schulz MH, Krawitz P, Bauer S, Dölken S, Ott CE, Mundlos C, Horn D, Mundlos S, Robinson PN. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet 2009; 85:457-64. [PMID: 19800049 PMCID: PMC2756558 DOI: 10.1016/j.ajhg.2009.09.003] [Citation(s) in RCA: 355] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2009] [Revised: 08/04/2009] [Accepted: 09/01/2009] [Indexed: 10/20/2022] Open
Abstract
The differential diagnostic process attempts to identify candidate diseases that best explain a set of clinical features. This process can be complicated by the fact that the features can have varying degrees of specificity, as well as by the presence of features unrelated to the disease itself. Depending on the experience of the physician and the availability of laboratory tests, clinical abnormalities may be described in greater or lesser detail. We have adapted semantic similarity metrics to measure phenotypic similarity between queries and hereditary diseases annotated with the use of the Human Phenotype Ontology (HPO) and have developed a statistical model to assign p values to the resulting similarity scores, which can be used to rank the candidate diseases. We show that our approach outperforms simpler term-matching approaches that do not take the semantic interrelationships between terms into account. The advantage of our approach was greater for queries containing phenotypic noise or imprecise clinical descriptions. The semantic network defined by the HPO can be used to refine the differential diagnosis by suggesting clinical features that, if present, best differentiate among the candidate diagnoses. Thus, semantic similarity searches in ontologies represent a useful way of harnessing the semantic structure of human phenotypic abnormalities to help with the differential diagnosis. We have implemented our methods in a freely available web application for the field of human Mendelian disorders.
Collapse
Affiliation(s)
- Sebastian Köhler
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Marcel H. Schulz
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
- International Max Planck Research School for Computational Biology and Scientific Computing, 14195 Berlin, Germany
| | - Peter Krawitz
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Sebastian Bauer
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Sandra Dölken
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Claus E. Ott
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Christine Mundlos
- Allianz Chronischer Seltener Erkrankungen (ACHSE), 14050 Berlin, Germany
| | - Denise Horn
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Stefan Mundlos
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | - Peter N. Robinson
- Institute for Medical Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| |
Collapse
|
28
|
Voevodski K, Teng SH, Xia Y. Finding local communities in protein networks. BMC Bioinformatics 2009; 10:297. [PMID: 19765306 PMCID: PMC2755114 DOI: 10.1186/1471-2105-10-297] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Accepted: 09/18/2009] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) play fundamental roles in nearly all biological processes, and provide major insights into the inner workings of cells. A vast amount of PPI data for various organisms is available from BioGRID and other sources. The identification of communities in PPI networks is of great interest because they often reveal previously unknown functional ties between proteins. A large number of global clustering algorithms have been applied to protein networks, where the entire network is partitioned into clusters. Here we take a different approach by looking for local communities in PPI networks. RESULTS We develop a tool, named Local Protein Community Finder, which quickly finds a community close to a queried protein in any network available from BioGRID or specified by the user. Our tool uses two new local clustering algorithms Nibble and PageRank-Nibble, which look for a good cluster among the most popular destinations of a short random walk from the queried vertex. The quality of a cluster is determined by proportion of outgoing edges, known as conductance, which is a relative measure particularly useful in undersampled networks. We show that the two local clustering algorithms find communities that not only form excellent clusters, but are also likely to be biologically relevant functional components. We compare the performance of Nibble and PageRank-Nibble to other popular and effective graph partitioning algorithms, and show that they find better clusters in the graph. Moreover, Nibble and PageRank-Nibble find communities that are more functionally coherent. CONCLUSION The Local Protein Community Finder, accessible at http://xialab.bu.edu/resources/lpcf, allows the user to quickly find a high-quality community close to a queried protein in any network available from BioGRID or specified by the user. We show that the communities found by our tool form good clusters and are functionally coherent, making our application useful for biologists who wish to investigate functional modules that a particular protein is a part of.
Collapse
Affiliation(s)
| | - Shang-Hua Teng
- Department of Computer Science, Boston University, Boston, MA 02215, USA
- Microsoft Research New England, Cambridge, MA 02142, USA
| | - Yu Xia
- Bioinformatics Graduate Program, Boston University, Boston, MA 02215, USA
- Department of Chemistry, Boston University, Boston, MA 02215, USA
| |
Collapse
|
29
|
Chagoyen M, Carazo JM, Pascual-Montano A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008; 9:444. [PMID: 18937846 PMCID: PMC2588600 DOI: 10.1186/1471-2105-9-444] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 10/20/2008] [Indexed: 11/23/2022] Open
Abstract
Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at
Collapse
|
30
|
Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot A, Vazquez A, Murray RR, Simon C, Tardivo L, Tam S, Svrzikapa N, Fan C, de Smet AS, Motyl A, Hudson ME, Park J, Xin X, Cusick ME, Moore T, Boone C, Snyder M, Roth FP, Barabási AL, Tavernier J, Hill DE, Vidal M. High-quality binary protein interaction map of the yeast interactome network. Science 2008; 322:104-10. [PMID: 18719252 DOI: 10.1126/science.1158684] [Citation(s) in RCA: 977] [Impact Index Per Article: 57.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Current yeast interactome network maps contain several hundred molecular complexes with limited and somewhat controversial representation of direct binary interactions. We carried out a comparative quality assessment of current yeast interactome data sets, demonstrating that high-throughput yeast two-hybrid (Y2H) screening provides high-quality binary interaction information. Because a large fraction of the yeast binary interactome remains to be mapped, we developed an empirically controlled mapping framework to produce a "second-generation" high-quality, high-throughput Y2H data set covering approximately 20% of all yeast binary interactions. Both Y2H and affinity purification followed by mass spectrometry (AP/MS) data are of equally high quality but of a fundamentally different and complementary nature, resulting in networks with different topological and biological properties. Compared to co-complex interactome models, this binary map is enriched for transient signaling interactions and intercomplex connections with a highly significant clustering between essential proteins. Rather than correlating with essentiality, protein connectivity correlates with genetic pleiotropy.
Collapse
Affiliation(s)
- Haiyuan Yu
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008; 24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany.
| | | | | | | | | |
Collapse
|
32
|
Abstract
Functional similarity based on Gene Ontology (GO) annotation is used in diverse applications like gene clustering, gene expression data analysis, protein interaction prediction and evaluation. However, there exists no comprehensive resource of functional similarity values although such a database would facilitate the use of functional similarity measures in different applications. Here, we describe FunSimMat (Functional Similarity Matrix, http://funsimmat.bioinf.mpi-inf.mpg.de/), a large new database that provides several different semantic similarity measures for GO terms. It offers various precomputed functional similarity values for proteins contained in UniProtKB and for protein families in Pfam and SMART. The web interface allows users to efficiently perform both semantic similarity searches with GO terms and functional similarity searches with proteins or protein families. All results can be downloaded in tab-delimited files for use with other tools. An additional XML–RPC interface gives automatic online access to FunSimMat for programs and remote services.
Collapse
Affiliation(s)
- Andreas Schlicker
- Max Planck Institute for Informatics, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany.
| | | |
Collapse
|