1
|
Ayllón-Benítez A, Mougin F, Allali J, Thiébaut R, Thébault P. A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets. PLoS One 2018; 13:e0208037. [PMID: 30481204 PMCID: PMC6258551 DOI: 10.1371/journal.pone.0208037] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 11/09/2018] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations. RESULTS We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms' characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used.
Collapse
Affiliation(s)
- Aarón Ayllón-Benítez
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| | - Fleur Mougin
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Julien Allali
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Rodolphe Thiébaut
- Univ. Bordeaux, Inserm UMR 1219, INRIA SISTM, Bordeaux, France
- CHU de Bordeaux, Pole de sante publique, Service d’information medicale, Bordeaux, France
- Vaccine Research Institute, Creteil, France
| | - Patricia Thébault
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| |
Collapse
|
2
|
Ballouz S, Pavlidis P, Gillis J. Using predictive specificity to determine when gene set analysis is biologically meaningful. Nucleic Acids Res 2018; 45:e20. [PMID: 28204549 PMCID: PMC5389513 DOI: 10.1093/nar/gkw957] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 10/04/2016] [Accepted: 10/10/2016] [Indexed: 11/14/2022] Open
Abstract
Gene set analysis, which translates gene lists into enriched functions, is among the most common bioinformatic methods. Yet few would advocate taking the results at face value. Not only is there no agreement on the algorithms themselves, there is no agreement on how to benchmark them. In this paper, we evaluate the robustness and uniqueness of enrichment results as a means of assessing methods even where correctness is unknown. We show that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances. By providing a means of determining where enrichment analyses report non-specific and non-robust findings, we are able to assess where we can be confident in their use. We find significant progress in recent bias correction methods for enrichment and provide our own software implementation. Our approach can be readily adapted to any pre-existing package.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
3
|
Lu S, Cai C, Yan G, Zhou Z, Wan Y, Chen V, Chen L, Cooper GF, Obeid LM, Hannun YA, Lee AV, Lu X. Signal-Oriented Pathway Analyses Reveal a Signaling Complex as a Synthetic Lethal Target for p53 Mutations. Cancer Res 2016; 76:6785-6794. [PMID: 27758891 DOI: 10.1158/0008-5472.can-16-1740] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 08/31/2016] [Accepted: 09/18/2016] [Indexed: 11/16/2022]
Abstract
Defining processes that are synthetic lethal with p53 mutations in cancer cells may reveal possible therapeutic strategies. In this study, we report the development of a signal-oriented computational framework for cancer pathway discovery in this context. We applied our bipartite graph-based functional module discovery algorithm to identify transcriptomic modules abnormally expressed in multiple tumors, such that the genes in a module were likely regulated by a common, perturbed signal. For each transcriptomic module, we applied our weighted k-path merge algorithm to search for a set of somatic genome alterations (SGA) that likely perturbed the signal, that is, the candidate members of the pathway that regulate the transcriptomic module. Computational evaluations indicated that our methods-identified pathways were perturbed by SGA. In particular, our analyses revealed that SGA affecting TP53, PTK2, YWHAZ, and MED1 perturbed a set of signals that promote cell proliferation, anchor-free colony formation, and epithelial-mesenchymal transition (EMT). These proteins formed a signaling complex that mediates these oncogenic processes in a coordinated fashion. Disruption of this signaling complex by knocking down PTK2, YWHAZ, or MED1 attenuated and reversed oncogenic phenotypes caused by mutant p53 in a synthetic lethal manner. This signal-oriented framework for searching pathways and therapeutic targets is applicable to all cancer types, thus potentially impacting precision medicine in cancer. Cancer Res; 76(23); 6785-94. ©2016 AACR.
Collapse
Affiliation(s)
- Songjian Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Chunhui Cai
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Gonghong Yan
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania.,Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, Pennsylvania.,Magee-Womens Research Institute, Pittsburgh, Pennsylvania
| | - Zhuan Zhou
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania.,Department of Cell Biology, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Yong Wan
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania.,Department of Cell Biology, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Vicky Chen
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Lujia Chen
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Gregory F Cooper
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania.,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Lina M Obeid
- Department of Medicine, the State University of New York at Stony Brook, Stony Brook, New York
| | - Yusuf A Hannun
- Department of Medicine, the State University of New York at Stony Brook, Stony Brook, New York
| | - Adrian V Lee
- Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania. .,University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania.,Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, Pennsylvania.,Magee-Womens Research Institute, Pittsburgh, Pennsylvania
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania. .,Center for Causal Discovery, University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
4
|
Pesaranghader A, Matwin S, Sokolova M, Beiko RG. simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 2015; 32:1380-7. [PMID: 26708333 DOI: 10.1093/bioinformatics/btv755] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 12/21/2015] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Measures of protein functional similarity are essential tools for function prediction, evaluation of protein-protein interactions (PPIs) and other applications. Several existing methods perform comparisons between proteins based on the semantic similarity of their GO terms; however, these measures are highly sensitive to modifications in the topological structure of GO, tend to be focused on specific analytical tasks and concentrate on the GO terms themselves rather than considering their textual definitions. RESULTS We introduce simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions, which is based on the Gloss Vector measure commonly used in natural language processing. The simDEF approach builds optimized definition vectors for all relevant GO terms, and expresses the similarity of a pair of proteins as the cosine of the angle between their definition vectors. Relative to existing similarity measures, when validated on a yeast reference database, simDEF improves correlation with sequence homology by up to 50%, shows a correlation improvement >4% with gene expression in the biological process hierarchy of GO and increases PPI predictability by > 2.5% in F1 score for molecular function hierarchy. AVAILABILITY AND IMPLEMENTATION Datasets, results and source code are available at http://kiwi.cs.dal.ca/Software/simDEF CONTACT: ahmad.pgh@dal.ca or beiko@cs.dal.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland and
| | - Marina Sokolova
- Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada
| |
Collapse
|
5
|
Lu S, Lu KN, Cheng SY, Hu B, Ma X, Nystrom N, Lu X. Identifying Driver Genomic Alterations in Cancers by Searching Minimum-Weight, Mutually Exclusive Sets. PLoS Comput Biol 2015; 11:e1004257. [PMID: 26317392 PMCID: PMC4552843 DOI: 10.1371/journal.pcbi.1004257] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Accepted: 03/24/2015] [Indexed: 02/07/2023] Open
Abstract
An important goal of cancer genomic research is to identify the driving pathways underlying disease mechanisms and the heterogeneity of cancers. It is well known that somatic genome alterations (SGAs) affecting the genes that encode the proteins within a common signaling pathway exhibit mutual exclusivity, in which these SGAs usually do not co-occur in a tumor. With some success, this characteristic has been utilized as an objective function to guide the search for driver mutations within a pathway. However, mutual exclusivity alone is not sufficient to indicate that genes affected by such SGAs are in common pathways. Here, we propose a novel, signal-oriented framework for identifying driver SGAs. First, we identify the perturbed cellular signals by mining the gene expression data. Next, we search for a set of SGA events that carries strong information with respect to such perturbed signals while exhibiting mutual exclusivity. Finally, we design and implement an efficient exact algorithm to solve an NP-hard problem encountered in our approach. We apply this framework to the ovarian and glioblastoma tumor data available at the TCGA database, and perform systematic evaluations. Our results indicate that the signal-oriented approach enhances the ability to find informative sets of driver SGAs that likely constitute signaling pathways. An important goal of studying cancer genomics is to identify critical pathways that, when perturbed by somatic genomic alterations (SGAs) such as somatic mutations, copy number alterations and epigenomic alterations, cause cancers and underlie different clinical phenotypes. In this study, we present a framework for discovering perturbed signaling pathways in cancers by integrating genome alteration data and transcriptomic data from the Cancer Genome Atlas (TCGA) project. Since gene expression in a cell is regulated by cellular signaling systems, we used transcriptomic changes to reveal perturbed cellular signals in each tumor. We then combined the genomic alteration data to search for SGA events across multiple tumors that affected a common signal, thus identifying the candidate members of cancer pathways. Our results demonstrate the advantage of the signal-oriented pathway approach over previous methods.
Collapse
Affiliation(s)
- Songjian Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| | - Kevin N. Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Shi-Yuan Cheng
- Department of Neurology, Northwestern Brain Tumor Institute, Center for Genetic Medicine, The Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Bo Hu
- Department of Neurology, Northwestern Brain Tumor Institute, Center for Genetic Medicine, The Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Xiaojun Ma
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Nicholas Nystrom
- Pittsburgh Supercomputing Center, Pittsburgh, Pennsylvania, United States of America
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
6
|
Bettembourg C, Diot C, Dameron O. Optimal Threshold Determination for Interpreting Semantic Similarity and Particularity: Application to the Comparison of Gene Sets and Metabolic Pathways Using GO and ChEBI. PLoS One 2015; 10:e0133579. [PMID: 26230274 PMCID: PMC4521860 DOI: 10.1371/journal.pone.0133579] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Accepted: 06/30/2015] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The analysis of gene annotations referencing back to Gene Ontology plays an important role in the interpretation of high-throughput experiments results. This analysis typically involves semantic similarity and particularity measures that quantify the importance of the Gene Ontology annotations. However, there is currently no sound method supporting the interpretation of the similarity and particularity values in order to determine whether two genes are similar or whether one gene has some significant particular function. Interpretation is frequently based either on an implicit threshold, or an arbitrary one (typically 0.5). Here we investigate a method for determining thresholds supporting the interpretation of the results of a semantic comparison. RESULTS We propose a method for determining the optimal similarity threshold by minimizing the proportions of false-positive and false-negative similarity matches. We compared the distributions of the similarity values of pairs of similar genes and pairs of non-similar genes. These comparisons were performed separately for all three branches of the Gene Ontology. In all situations, we found overlap between the similar and the non-similar distributions, indicating that some similar genes had a similarity value lower than the similarity value of some non-similar genes. We then extend this method to the semantic particularity measure and to a similarity measure applied to the ChEBI ontology. Thresholds were evaluated over the whole HomoloGene database. For each group of homologous genes, we computed all the similarity and particularity values between pairs of genes. Finally, we focused on the PPAR multigene family to show that the similarity and particularity patterns obtained with our thresholds were better at discriminating orthologs and paralogs than those obtained using default thresholds. CONCLUSION We developed a method for determining optimal semantic similarity and particularity thresholds. We applied this method on the GO and ChEBI ontologies. Qualitative analysis using the thresholds on the PPAR multigene family yielded biologically-relevant patterns.
Collapse
Affiliation(s)
- Charles Bettembourg
- Université de Rennes 1, Rennes, France
- INRA, UMR1348 PEGASE, Saint-Gilles, France
- Agrocampus OUEST, UMR1348 PEGASE, Rennes, France
- IRISA, Campus de Beaulieu, Rennes, France
- INRIA, Rennes, France
- * E-mail:
| | - Christian Diot
- INRA, UMR1348 PEGASE, Saint-Gilles, France
- Agrocampus OUEST, UMR1348 PEGASE, Rennes, France
| | - Olivier Dameron
- Université de Rennes 1, Rennes, France
- IRISA, Campus de Beaulieu, Rennes, France
- INRIA, Rennes, France
| |
Collapse
|
7
|
Fan M, Low HS, Wenk MR, Wong L. A semi-automated methodology for finding lipid-related GO terms. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau089. [PMID: 25209026 PMCID: PMC4160098 DOI: 10.1093/database/bau089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Motivation: Although semantic similarity in Gene Ontology (GO) and other approaches may be used to find similar GO terms, there is yet a method to systematically find a class of GO terms sharing a common property with high accuracy (e.g. involving human curation). Results: We have developed a methodology to address this issue and applied it to identify lipid-related GO terms, owing to the important and varied roles of lipids in many biological processes. Our methodology finds lipid-related GO terms in a semi-automated manner, requiring only moderate manual curation. We first obtain a list of lipid-related gold-standard GO terms by keyword search and manual curation. Then, based on the hypothesis that co-annotated GO terms share similar properties, we develop a machine learning method that expands the list of lipid-related terms from the gold standard. Those terms predicted most likely to be lipid related are examined by a human curator following specific curation rules to confirm the class labels. The structure of GO is also exploited to help reduce the curation effort. The prediction and curation cycle is repeated until no further lipid-related term is found. Our approach has covered a high proportion, if not all, of lipid-related terms with relatively high efficiency. Database URL:http://compbio.ddns.comp.nus.edu.sg/∼lipidgo
Collapse
Affiliation(s)
- Mengyuan Fan
- Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074 Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074
| | - Hong Sang Low
- Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074
| | - Markus R Wenk
- Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074 Department of Computer Science, National University of Singapore, Singapore 117417, NUS Graduate School of Integrative Science and Engineering, National University of Singapore, Singapore 117456, National Research Foundation, Singapore 138602, Department of Biochemistry, National University of Singapore, Singapore 117599 and Department of Pathology, National University of Singapore, Singapore 119074
| |
Collapse
|
8
|
Semantic particularity measure for functional characterization of gene sets using gene ontology. PLoS One 2014; 9:e86525. [PMID: 24489737 PMCID: PMC3904913 DOI: 10.1371/journal.pone.0086525] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2013] [Accepted: 12/11/2013] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. RESULTS We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. CONCLUSION Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies.
Collapse
|
9
|
Abstract
BACKGROUND The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner. METHODS In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations. RESULTS We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph. CONCLUSIONS Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.
Collapse
|
10
|
Measuring the evolution of ontology complexity: the gene ontology case study. PLoS One 2013; 8:e75993. [PMID: 24146805 PMCID: PMC3795689 DOI: 10.1371/journal.pone.0075993] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 08/20/2013] [Indexed: 01/09/2023] Open
Abstract
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure. The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred. The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
Collapse
|
11
|
Lu S, Jin B, Cowart LA, Lu X. From data towards knowledge: revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data. PLoS One 2013; 8:e61134. [PMID: 23637789 PMCID: PMC3634064 DOI: 10.1371/journal.pone.0061134] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2012] [Accepted: 03/05/2013] [Indexed: 11/18/2022] Open
Abstract
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal; and 3) revealing the architecture of a signaling system by organizing signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis has led to many new hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architecture of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which can be readily translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in the MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed."
Collapse
Affiliation(s)
- Songjian Lu
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| | - Bo Jin
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| | - L. Ashley Cowart
- Department of Biochemistry and Molecular Biology, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
12
|
Škunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012; 8:e1002533. [PMID: 22693439 PMCID: PMC3364937 DOI: 10.1371/journal.pcbi.1002533] [Citation(s) in RCA: 97] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 04/01/2012] [Indexed: 01/10/2023] Open
Abstract
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. In the UniProt Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these “electronic GO annotations” are generally perceived as unreliable; they are disregarded in many studies. In this article, we introduce novel methodology to systematically evaluate the quality of electronic annotations. We then provide the first comprehensive assessment of the reliability of electronic GO annotations. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference.
Collapse
Affiliation(s)
- Nives Škunca
- Ruđer Bošković Institute, Division of Electronics, Zagreb, Croatia
- ETH Zurich, Computer Science, Zurich, Switzerland
| | - Adrian Altenhoff
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
- EMBL-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
13
|
Lu S, Lu X. Integrating genome and functional genomics data to reveal perturbed signaling pathways in ovarian cancers. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2012; 2012:72-78. [PMID: 22779056 PMCID: PMC3392049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Cancers are genetic diseases, driven by somatic mutations that perturb cellular signaling systems. In this study, we aim to reveal the signal transduction pathways that are perturbed by mutations in ovarian cancer. Our approach searches for genetic mutations that lead to a common cellular response, e.g., differential expression of a set of functional related genes. To this end, we first developed a knowledge mining approach to identify functional expression modules; we then developed a graph-based data mining approach to identify mutations that are highly related to the functional modules, as a means to re-constitute signal pathways. Our results indicate that unification of knowledge mining with data mining significantly enhance identification of potential signaling pathways in ovarian cancers.
Collapse
Affiliation(s)
- Songjian Lu
- Dept. Biomedical Informatics, Univ. Pittsburgh, PA 15232
| | | |
Collapse
|