1
|
Abstract
Gene Ontology-based semantic similarity (SS) allows the comparison of GO terms or entities annotated with GO terms, by leveraging on the ontology structure and properties and on annotation corpora. In the last decade the number and diversity of SS measures based on GO has grown considerably, and their application ranges from functional coherence evaluation, protein interaction prediction, and disease gene prioritization.Understanding how SS measures work, what issues can affect their performance and how they compare to each other in different evaluation settings is crucial to gain a comprehensive view of this area and choose the most appropriate approaches for a given application.In this chapter, we provide a guide to understanding and selecting SS measures for biomedical researchers. We present a straightforward categorization of SS measures and describe the main strategies they employ. We discuss the intrinsic and external issues that affect their performance, and how these can be addressed. We summarize comparative assessment studies, highlighting the top measures in different settings, and compare different implementation strategies and their use. Finally, we discuss some of the extant challenges and opportunities, namely the increased semantic complexity of GO and the need for fast and efficient computation, pointing the way towards the future generation of SS measures.
Collapse
Affiliation(s)
- Catia Pesquita
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Edifício C6, Piso 3, Campo Grande, 1749-016, Lisbon, Portugal.
| |
Collapse
|
2
|
Bastos HP, Sousa L, Clarke LA, Couto FM. Functional coherence metrics in protein families. J Biomed Semantics 2016; 7:41. [PMID: 27338101 PMCID: PMC4917928 DOI: 10.1186/s13326-016-0076-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 05/17/2016] [Indexed: 12/03/2022] Open
Abstract
Background Biological sequences, such as proteins, have been provided with annotations that assign functional information. These functional annotations are associations of proteins (or other biological sequences) with descriptors characterizing their biological roles. However, not all proteins are fully (or even at all) annotated. This annotation incompleteness limits our ability to make sound assertions about the functional coherence within sets of proteins. Annotation incompleteness is a problematic issue when measuring semantic functional similarity of biological sequences since they can only capture a limited amount of all the semantic aspects the sequences may encompass. Methods Instead of relying uniquely on single (reductive) metrics, this work proposes a comprehensive approach for assessing functional coherence within protein sets. The approach entails using visualization and term enrichment techniques anchored in specific domain knowledge, such as a protein family. For that purpose we evaluate two novel functional coherence metrics, mUI and mGIC that combine aspects of semantic similarity measures and term enrichment. Results These metrics were used to effectively capture and measure the local similarity cores within protein sets. Hence, these metrics coupled with visualization tools allow an improved grasp on three important functional annotation aspects: completeness, agreement and coherence. Conclusions Measuring the functional similarity between proteins based on their annotations is a non trivial task. Several metrics exist but due both to characteristics intrinsic to the nature of graphs and extrinsic natures related to the process of annotation each measure can only capture certain functional annotation aspects of proteins. Hence, when trying to measure the functional coherence of a set of proteins a single metric is too reductive. Therefore, it is valuable to be aware of how each employed similarity metric works and what similarity aspects it can best capture. Here we test the behaviour and resilience of some similarity metrics. Electronic supplementary material The online version of this article (doi:10.1186/s13326-016-0076-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hugo P Bastos
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Lisete Sousa
- CEAUL, Departamento de Estatística e Investigação Operacional, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749-016, Portugal
| | - Luka A Clarke
- BioISI - Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749-016, Portugal
| | - Francisco M Couto
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.
| |
Collapse
|
3
|
Tsoi LC, Elder JT, Abecasis GR. Graphical algorithm for integration of genetic and biological data: proof of principle using psoriasis as a model. ACTA ACUST UNITED AC 2014; 31:1243-9. [PMID: 25480373 DOI: 10.1093/bioinformatics/btu799] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2014] [Accepted: 11/26/2014] [Indexed: 01/17/2023]
Abstract
MOTIVATION Pathway analysis to reveal biological mechanisms for results from genetic association studies have great potential to better understand complex traits with major human disease impact. However, current approaches have not been optimized to maximize statistical power to identify enriched functions/pathways, especially when the genetic data derives from studies using platforms (e.g. Immunochip and Metabochip) customized to have pre-selected markers from previously identified top-rank loci. We present here a novel approach, called Minimum distance-based Enrichment Analysis for Genetic Association (MEAGA), with the potential to address both of these important concerns. RESULTS MEAGA performs enrichment analysis using graphical algorithms to identify sub-graphs among genes and measure their closeness in interaction database. It also incorporates a statistic summarizing the numbers and total distances of the sub-graphs, depicting the overlap between observed genetic signals and defined function/pathway gene-sets. MEAGA uses sampling technique to approximate empirical and multiple testing-corrected P-values. We show in simulation studies that MEAGA is more powerful compared to count-based strategies in identifying disease-associated functions/pathways, and the increase in power is influenced by the shortest distances among associated genes in the interactome. We applied MEAGA to the results of a meta-analysis of psoriasis using Immunochip datasets, and showed that associated genes are significantly enriched in immune-related functions and closer with each other in the protein-protein interaction network. AVAILABILITY AND IMPLEMENTATION http://genome.sph.umich.edu/wiki/MEAGA CONTACT: : tsoi.teen@gmail.com or goncalo@umich.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lam C Tsoi
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA, Department of Dermatology, University of Michigan, Ann Arbor, MI, USA, and Ann Arbor Veterans Affairs Hospital, Ann Arbor, MI, USA
| | - James T Elder
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA, Department of Dermatology, University of Michigan, Ann Arbor, MI, USA, and Ann Arbor Veterans Affairs Hospital, Ann Arbor, MI, USA Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA, Department of Dermatology, University of Michigan, Ann Arbor, MI, USA, and Ann Arbor Veterans Affairs Hospital, Ann Arbor, MI, USA
| | - Goncalo R Abecasis
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA, Department of Dermatology, University of Michigan, Ann Arbor, MI, USA, and Ann Arbor Veterans Affairs Hospital, Ann Arbor, MI, USA
| |
Collapse
|
4
|
Fu LM, Fu KA. Analysis of Parkinson's disease pathophysiology using an integrated genomics-bioinformatics approach. ACTA ACUST UNITED AC 2014; 22:15-29. [PMID: 25466606 DOI: 10.1016/j.pathophys.2014.10.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2014] [Revised: 10/11/2014] [Accepted: 10/20/2014] [Indexed: 11/28/2022]
Abstract
The pathogenesis and pathophysiology of a disease determine how it should be diagnosed and treated. Yet, understanding the cause and mechanisms of progression often requires intensive human efforts, especially for diseases with complex etiology. The latest genomic technology coupled with advanced, large-scale data analysis in the field known as bioinformatics has promised a high-throughput approach that can quickly identify disease-affected genes and pathways by examining tissue samples collected from patients and control subjects. Furthermore, significant biological themes indicative of genomic events can be recognized on the basis of affected genes. However, given identified biological themes, it is not clear how to organize genomic events to arrive at a coherent pathophysiological explanation about the disease. To address this important issue, we have developed an innovative method named "Expression Data Up-Stream Analysis" (EDUSA) that can perform a bioinformatics analysis to identify and rank upstream processes effectively. We applied it to Parkinson's disease (PD) using a genomic data set available at a public data repository known as Gene Expression Omnibus (GEO). In this study, disease-affected genes were identified using GEO2R software, and disease-pertinent processes were identified using EASE software. Then the EDUSA program was used to determine the upstream versus downstream hierarchy of the processes. The results confirmed the current misfolded protein theory about the pathogenesis of PD, and provided new insights as well. Particularly, our program discovered that RNA (ribonucleic acid) metabolism pathology was a potential cause of PD, which in fact, is an emerging theory of neurodegenerative disorders. In addition, it was found that the dysfunction of the transport system seemed to occur in the early phase of neurodegeneration, whereas mitochondrial dysfunction appeared at a later stage. Using this methodology, we have demonstrated how to determine the stages of disease development with single-point data collection.
Collapse
Affiliation(s)
- Li M Fu
- Biomedical Engineering Department, AHMC Healthcare, Los Angeles, CA, USA.
| | - Katherine A Fu
- Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
5
|
Abstract
BACKGROUND The Gene Ontology (GO) is an ontology representing molecular biology concepts related to genes and their products. Current annotations from the GO Consortium tend to be highly specific, and contemporary genome-scale studies often return a long list of genes of potential interest, such as genes in a cancer tumor that are differentially expressed than those found in normal tissue. It is therefore a challenging task to reveal, at a conceptual level, the major functional themes in which genes are involved. Presently, there is a need for tools capable of revealing such themes through mining and representing semantic information in an objective and quantitative manner. METHODS In this study, we utilized the hierarchical organization of the GO to derive a more abstract representation of the major biological processes of a list of genes based on their annotations. We cast the task as follows: given a list of genes, identify non-disjoint, functionally coherent subsets, such that the functions of the genes in a subset are summarized by an informative GO term that accurately captures the semantic information of the original annotations. RESULTS We evaluated different metrics for assessing information loss when merging GO terms, and different statistical schemes to assess the functional coherence of a set of genes. We found that the best discriminative power was achieved by using a combination of the information-content-based measure as the information-loss metric, and the graph-based statistics derived from a Steiner tree connecting genes in an augmented GO graph. CONCLUSIONS Our methods provide an objective and quantitative approach to capturing the major directions of gene functions in a context-specific fashion.
Collapse
|
6
|
Valsesia A, Macé A, Jacquemont S, Beckmann JS, Kutalik Z. The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation. Front Genet 2013; 4:92. [PMID: 23750167 PMCID: PMC3667386 DOI: 10.3389/fgene.2013.00092] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2013] [Accepted: 05/04/2013] [Indexed: 02/03/2023] Open
Abstract
Differences between genomes can be due to single nucleotide variants, translocations, inversions, and copy number variants (CNVs, gain or loss of DNA). The latter can range from sub-microscopic events to complete chromosomal aneuploidies. Small CNVs are often benign but those larger than 500 kb are strongly associated with morbid consequences such as developmental disorders and cancer. Detecting CNVs within and between populations is essential to better understand the plasticity of our genome and to elucidate its possible contribution to disease. Hence there is a need for better-tailored and more robust tools for the detection and genome-wide analyses of CNVs. While a link between a given CNV and a disease may have often been established, the relative CNV contribution to disease progression and impact on drug response is not necessarily understood. In this review we discuss the progress, challenges, and limitations that occur at different stages of CNV analysis from the detection (using DNA microarrays and next-generation sequencing) and identification of recurrent CNVs to the association with phenotypes. We emphasize the importance of germline CNVs and propose strategies to aid clinicians to better interpret structural variations and assess their clinical implications.
Collapse
Affiliation(s)
- Armand Valsesia
- Genetics Core, Nestlé Institute of Health Sciences Lausanne, Switzerland
| | | | | | | | | |
Collapse
|
7
|
Lu S, Jin B, Cowart LA, Lu X. From data towards knowledge: revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data. PLoS One 2013; 8:e61134. [PMID: 23637789 PMCID: PMC3634064 DOI: 10.1371/journal.pone.0061134] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2012] [Accepted: 03/05/2013] [Indexed: 11/18/2022] Open
Abstract
Genetic and pharmacological perturbation experiments, such as deleting a gene and monitoring gene expression responses, are powerful tools for studying cellular signal transduction pathways. However, it remains a challenge to automatically derive knowledge of a cellular signaling system at a conceptual level from systematic perturbation-response data. In this study, we explored a framework that unifies knowledge mining and data mining towards the goal. The framework consists of the following automated processes: 1) applying an ontology-driven knowledge mining approach to identify functional modules among the genes responding to a perturbation in order to reveal potential signals affected by the perturbation; 2) applying a graph-based data mining approach to search for perturbations that affect a common signal; and 3) revealing the architecture of a signaling system by organizing signaling units into a hierarchy based on their relationships. Applying this framework to a compendium of yeast perturbation-response data, we have successfully recovered many well-known signal transduction pathways; in addition, our analysis has led to many new hypotheses regarding the yeast signal transduction system; finally, our analysis automatically organized perturbed genes as a graph reflecting the architecture of the yeast signaling system. Importantly, this framework transformed molecular findings from a gene level to a conceptual level, which can be readily translated into computable knowledge in the form of rules regarding the yeast signaling system, such as "if genes involved in the MAPK signaling are perturbed, genes involved in pheromone responses will be differentially expressed."
Collapse
Affiliation(s)
- Songjian Lu
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| | - Bo Jin
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| | - L. Ashley Cowart
- Department of Biochemistry and Molecular Biology, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburth, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
8
|
Revealing functionally coherent subsets using a spectral clustering and an information integration approach. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S7. [PMID: 23282411 PMCID: PMC3542577 DOI: 10.1186/1752-0509-6-s3-s7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Background Contemporary high-throughput analyses often produce lengthy lists of genes or proteins. It is desirable to divide the genes into functionally coherent subsets for further investigation, by integrating heterogeneous information regarding the genes. Here we report a principled approach for managing and integrating multiple data sources within the framework of graph-spectrum analysis in order to identify coherent gene subsets. Results We investigated several approaches to integrate information derived from different sources that reflect distinct aspects of gene functional relationships including: functional annotations of genes in the form of the Gene Ontology, co-mentioning of genes in the literature, and shared transcription factor binding sites among genes. Given a list of genes, we construct a graph containing the genes in each information space; then the graphs were kernel transformed so they could be integrated; finally functionally coherent subsets were identified using a spectral clustering algorithm. In a series of simulation experiments, known functionally coherent gene sets were mixed and recovered using our approach. Conclusions The results indicate that spectral clustering approaches are capable of recovering coherent gene modules even under noisy conditions, and that information integration serves to further enhance this capability. When applied to a real-world data set, our methods revealed biologically sensible modules, and highlighted the importance of information integration. The implementation of the statistical model is provided under the GNU general public license, as an installable Python module, at: http://code.google.com/p/spectralmix.
Collapse
|
9
|
Xu L, Cheng C, George EO, Homayouni R. Literature aided determination of data quality and statistical significance threshold for gene expression studies. BMC Genomics 2012; 13 Suppl 8:S23. [PMID: 23282414 PMCID: PMC3535704 DOI: 10.1186/1471-2164-13-s8-s23] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data. Methods Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment. Results We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound. Conclusions We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments.
Collapse
Affiliation(s)
- Lijing Xu
- Bioinformatics Program, Memphis, TN 38152, USA
| | | | | | | |
Collapse
|
10
|
Acharya L, Judeh T, Duan Z, Rabbat M, Zhu D. GSGS: a computational approach to reconstruct signaling pathway structures from gene sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:438-450. [PMID: 22025758 DOI: 10.1109/tcbb.2011.143] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells. Existing approaches often rely on unrealistic biological assumptions and do not explicitly consider signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from cell surface to nucleus and characterize a signaling pathway. We propose a novel approach, Gene Set Gibbs Sampling, to reverse engineer signaling pathway structures from gene sets related to pathways. We hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the underlying IF but not their ordering. GSGS offers a Gibbs sampling procedure to reconstruct the underlying signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof-of-concept studies, our approach is shown to outperform existing network inference approaches using data generated from benchmark networks in DREAM. We perform a sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to reconstruct signaling mechanisms in breast cancer cells.
Collapse
|
11
|
Díaz-Díaz N, Aguilar-Ruiz JS. GO-based functional dissimilarity of gene sets. BMC Bioinformatics 2011; 12:360. [PMID: 21884611 PMCID: PMC3248071 DOI: 10.1186/1471-2105-12-360] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2010] [Accepted: 09/01/2011] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. RESULTS To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. CONCLUSIONS Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD.
Collapse
|
12
|
Moutselos K, Maglogiannis I, Chatziioannou A. GOrevenge: a novel generic reverse engineering method for the identification of critical molecular players, through the use of ontologies. IEEE Trans Biomed Eng 2011; 58:3522-7. [PMID: 21846603 DOI: 10.1109/tbme.2011.2164794] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The ever-increasing use of ontologies in modern biological analysis and interpretation facilitates the understanding of the cellular procedures, their hierarchical organization, and their potential interactions at a system's level. Currently, the gene ontology serves as a paradigm, where through the annotation of whole genomes of certain organisms, genes subsets selected, either from high-throughput experiments or with an established pivotal role regarding the probed disease, can act as a starting point for the exploration of their underlying functional interconnections. This may also aid the elucidation of hidden regulatory mechanisms among genes. Reverse engineering the functional relevance of genes to specific cellular pathways and vice versa, through the exploitation of the inner structure of the ontological vocabularies, may help impart insight regarding the identification and prioritization of the critical role of specific genes. The proposed graph-theoretical method is showcased in a pancreatic cancer and a T-cell acute lymphoblastic leukemia gene set, incorporating edge and Resnik semantic similarity metrics, and systematically evaluated regarding its performance.
Collapse
Affiliation(s)
- Konstantinos Moutselos
- Department of Computer Science and Biomedical Informatics, University of Central Greece, Lamia 35100, Greece.
| | | | | |
Collapse
|
13
|
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M. Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis. BMC Bioinformatics 2011; 12:203. [PMID: 21612636 PMCID: PMC3118170 DOI: 10.1186/1471-2105-12-203] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2010] [Accepted: 05/25/2011] [Indexed: 12/18/2022] Open
Abstract
Background Combining multiple evidence-types from different information sources has the potential to reveal new relationships in biological systems. The integrated information can be represented as a relationship network, and clustering the network can suggest possible functional modules. The value of such modules for gaining insight into the underlying biological processes depends on their functional coherence. The challenges that we wish to address are to define and quantify the functional coherence of modules in relationship networks, so that they can be used to infer function of as yet unannotated proteins, to discover previously unknown roles of proteins in diseases as well as for better understanding of the regulation and interrelationship between different elements of complex biological systems. Results We have defined the functional coherence of modules with respect to the Gene Ontology (GO) by considering two complementary aspects: (i) the fragmentation of the GO functional categories into the different modules and (ii) the most representative functions of the modules. We have proposed a set of metrics to evaluate these two aspects and demonstrated their utility in Arabidopsis thaliana. We selected 2355 proteins for which experimentally established protein-protein interaction (PPI) data were available. From these we have constructed five relationship networks, four based on single types of data: PPI, co-expression, co-occurrence of protein names in scientific literature abstracts and sequence similarity and a fifth one combining these four evidence types. The ability of these networks to suggest biologically meaningful grouping of proteins was explored by applying Markov clustering and then by measuring the functional coherence of the clusters. Conclusions Relationship networks integrating multiple evidence-types are biologically informative and allow more proteins to be assigned to a putative functional module. Using additional evidence types concentrates the functional annotations in a smaller number of modules without unduly compromising their consistency. These results indicate that integration of more data sources improves the ability to uncover functional association between proteins, both by allowing more proteins to be linked and producing a network where modular structure more closely reflects the hierarchy in the gene ontology.
Collapse
Affiliation(s)
- Artem Lysenko
- Centre for Mathematical and Computational Biology, Rothamsted Research, Harpenden, Herts, AL5, 2JQ, UK.
| | | | | | | | | | | | | |
Collapse
|
14
|
Defoin-Platel M, Hassani-Pak K, Rawlings C. Gaining confidence in cross-species annotation transfer: from simple molecular function to complex phenotypic traits. ASPECTS OF APPLIED BIOLOGY 2011; 107:79-87. [PMID: 22319070 PMCID: PMC3272443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Cross-species annotation transfer is a widely used approach for transferring information about simple molecular functions or pathways from one protein in one species to its ortholog in another species. In crop species, the phenotypic traits of interest, such as grain yield, are very complex and are often related to multiple biological processes and systems. It is still unclear to what extent the high level annotations describing phenotypic traits can also be reliably transferred across species. In this work, we have developed a procedure to measure precisely the transferability of these functional annotations from one species to another and demonstrate its application to Arabidopsis and several crop species. This comparative analysis is a step towards assigning higher level biological function to genes and gene networks as part of the wider genotype to phenotype challenge.
Collapse
|
15
|
Zeeberg BR, Liu H, Kahn AB, Ehler M, Rajapakse VN, Bonner RF, Brown JD, Brooks BP, Larionov VL, Reinhold W, Weinstein JN, Pommier YG. RedundancyMiner: De-replication of redundant GO categories in microarray and proteomics analysis. BMC Bioinformatics 2011; 12:52. [PMID: 21310028 PMCID: PMC3223614 DOI: 10.1186/1471-2105-12-52] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2010] [Accepted: 02/10/2011] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND The Gene Ontology (GO) Consortium organizes genes into hierarchical categories based on biological process, molecular function and subcellular localization. Tools such as GoMiner can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. Two or more of the categories are often redundant, in the sense that identical or nearly-identical sets of genes map to the categories. The redundancy might typically inflate the report of significant categories by a factor of three-fold, create an illusion of an overly long list of significant categories, and obscure the relevant biological interpretation. RESULTS We now introduce a new resource, RedundancyMiner, that de-replicates the redundant and nearly-redundant GO categories that had been determined by first running GoMiner. The main algorithm of RedundancyMiner, MultiClust, performs a novel form of cluster analysis in which a GO category might belong to several category clusters. Each category cluster follows a "complete linkage" paradigm. The metric is a similarity measure that captures the overlap in gene mapping between pairs of categories. CONCLUSIONS RedundancyMiner effectively eliminated redundancies from a set of GO categories. For illustration, we have applied it to the clarification of the results arising from two current studies: (1) assessment of the gene expression profiles obtained by laser capture microdissection (LCM) of serial cryosections of the retina at the site of final optic fissure closure in the mouse embryos at specific embryonic stages, and (2) analysis of a conceptual data set obtained by examining a list of genes deemed to be "kinetochore" genes.
Collapse
Affiliation(s)
- Barry R Zeeberg
- Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH, Room 5068, Building 37, 37 Convent Drive, Bethesda, MD 20892, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Jin B, Lu X. Identifying informative subsets of the Gene Ontology with information bottleneck methods. ACTA ACUST UNITED AC 2010; 26:2445-51. [PMID: 20702400 DOI: 10.1093/bioinformatics/btq449] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The Gene Ontology (GO) is a controlled vocabulary designed to represent the biological concepts pertaining to gene products. This study investigates the methods for identifying informative subsets of GO terms in an automatic and objective fashion. This task in turn requires addressing the following issues: how to represent the semantic context of GO terms, what metrics are suitable for measuring the semantic differences between terms, how to identify an informative subset that retains as much as possible of the original semantic information of GO. RESULTS We represented the semantic context of a GO term using the word-usage-profile associated with the term, which enables one to measure the semantic differences between terms based on the differences in their semantic contexts. We further employed the information bottleneck methods to automatically identify subsets of GO terms that retain as much as possible of the semantic information in an annotation database. The automatically retrieved informative subsets align well with an expert-picked GO slim subset, cover important concepts and proteins, and enhance literature-based GO annotation. AVAILABILITY http://carcweb.musc.edu/TextminingProjects/.
Collapse
Affiliation(s)
- Bo Jin
- Department of Biochemistry and Molecular Biology, Medical University of South Carolina, 174 Ashley Ave, Charleston, SC 29425, USA
| | | |
Collapse
|