1
|
Gong F, Cao D, Sun X, Li Z, Qu C, Fan Y, Cao Z, Zhao K, Zhao K, Qiu D, Li Z, Ren R, Ma X, Zhang X, Yin D. Homologous mapping yielded a comprehensive predicted protein-protein interaction network for peanut (Arachis hypogaea L.). BMC PLANT BIOLOGY 2024; 24:873. [PMID: 39304811 DOI: 10.1186/s12870-024-05580-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 09/09/2024] [Indexed: 09/22/2024]
Abstract
BACKGROUND Protein-protein interactions are the primary means through which proteins carry out their functions. These interactions thus have crucial roles in life activities. The wide availability of fully sequenced animal and plant genomes has facilitated establishment of relatively complete global protein interaction networks for some model species. The genomes of cultivated and wild peanut (Arachis hypogaea L.) have also been sequenced, but the functions of most of the encoded proteins remain unclear. RESULTS We here used homologous mapping of validated protein interaction data from model species to generate complete peanut protein interaction networks for A. hypogaea cv. 'Tifrunner' (282,619 pairs), A. hypogaea cv. 'Shitouqi' (256,441 pairs), A. monticola (440,470 pairs), A. duranensis (136,363 pairs), and A. ipaensis (172,813 pairs). A detailed analysis was conducted for a putative disease-resistance subnetwork in the Tifrunner network to identify candidate genes and validate functional interactions. The network suggested that DX2UEH and its interacting partners may participate in peanut resistance to bacterial wilt; this was preliminarily validated with overexpression experiments in peanut. CONCLUSION Our results provide valuable new information for future analyses of gene and protein functions and regulatory networks in peanut.
Collapse
Affiliation(s)
- Fangping Gong
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Di Cao
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Xiaojian Sun
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Zhuo Li
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Chengxin Qu
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Yi Fan
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Zenghui Cao
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Kai Zhao
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Kunkun Zhao
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Ding Qiu
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Zhongfeng Li
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Rui Ren
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Xingli Ma
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Xingguo Zhang
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China
| | - Dongmei Yin
- College of Agronomy, Henan Agricultural University, Zhengzhou, 450000, People's Republic of China.
| |
Collapse
|
2
|
Xian L, Wang Y. Advances in Computational Methods for Protein–Protein Interaction Prediction. ELECTRONICS 2024; 13:1059. [DOI: 10.3390/electronics13061059] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
Protein–protein interactions (PPIs) are pivotal in various physiological processes inside biological entities. Accurate identification of PPIs holds paramount significance for comprehending biological processes, deciphering disease mechanisms, and advancing medical research. Given the costly and labor-intensive nature of experimental approaches, a multitude of computational methods have been devised to enable swift and large-scale PPI prediction. This review offers a thorough examination of recent strides in computational methodologies for PPI prediction, with a particular focus on the utilization of deep learning techniques within this domain. Alongside a systematic classification and discussion of relevant databases, feature extraction strategies, and prominent computational approaches, we conclude with a thorough analysis of current challenges and prospects for the future of this field.
Collapse
Affiliation(s)
- Lei Xian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
| |
Collapse
|
3
|
Kartheeswaran KP, Rayan AXA, Varrieth GT. Enhanced disease-disease association with information enriched disease representation. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:8892-8932. [PMID: 37161227 DOI: 10.3934/mbe.2023391] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
OBJECTIVE Quantification of disease-disease association (DDA) enables the understanding of disease relationships for discovering disease progression and finding comorbidity. For effective DDA strength calculation, there is a need to address the main challenge of integration of various biomedical aspects of DDA is to obtain an information rich disease representation. MATERIALS AND METHODS An enhanced and integrated DDA framework is developed that integrates enriched literature-based with concept-based DDA representation. The literature component of the proposed framework uses PubMed abstracts and consists of improved neural network model that classifies DDAs for an enhanced literature-based DDA representation. Similarly, an ontology-based joint multi-source association embedding model is proposed in the ontology component using Disease Ontology (DO), UMLS, claims insurance, clinical notes etc. Results and Discussion: The obtained information rich disease representation is evaluated on different aspects of DDA datasets such as Gene, Variant, Gene Ontology (GO) and a human rated benchmark dataset. The DDA scores calculated using the proposed method achieved a high correlation mainly in gene-based dataset. The quantified scores also shown better correlation of 0.821, when evaluated on human rated 213 disease pairs. In addition, the generated disease representation is proved to have substantial effect on correlation of DDA scores for different categories of disease pairs. CONCLUSION The enhanced context and semantic DDA framework provides an enriched disease representation, resulting in high correlated results with different DDA datasets. We have also presented the biological interpretation of disease pairs. The developed framework can also be used for deriving the strength of other biomedical associations.
Collapse
|
4
|
Paul M, Anand A. A New Family of Similarity Measures for Scoring Confidence of Protein Interactions Using Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:19-30. [PMID: 34029194 DOI: 10.1109/tcbb.2021.3083150] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The large-scale protein-protein interaction (PPI) data has the potential to play a significant role in the endeavor of understanding cellular processes. However, the presence of a considerable fraction of false positives is a bottleneck in realizing this potential. There have been continuous efforts to utilize complementary resources for scoring confidence of PPIs in a manner that false positive interactions get a low confidence score. Gene Ontology (GO), a taxonomy of biological terms to represent the properties of gene products and their relations, has been widely used for this purpose. We utilize GO to introduce a new set of specificity measures: Relative Depth Specificity (RDS), Relative Node-based Specificity (RNS), and Relative Edge-based Specificity (RES), leading to a new family of similarity measures. We use these similarity measures to obtain a confidence score for each PPI. We evaluate the new measures using four different benchmarks. We show that all the three measures are quite effective. Notably, RNS and RES more effectively distinguish true PPIs from false positives than the existing alternatives. RES also shows a robust set-discriminating power and can be useful for protein functional clustering as well.
Collapse
|
5
|
Shu L, Zhou C, Yuan X, Zhang J, Deng L. MSCFS: inferring circRNA functional similarity based on multiple data sources. BMC Bioinformatics 2021; 22:371. [PMID: 34271851 PMCID: PMC8285884 DOI: 10.1186/s12859-021-04287-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Accepted: 07/06/2021] [Indexed: 12/13/2022] Open
Abstract
Background More and more evidence shows that circRNA plays an important role in various biological processes and human health. Therefore, inferring the circRNA’s potential functions and obtaining circRNA functional similarity has become more and more significant. However, there is no effective approach to explore the functional similarity of circRNAs. Methods In this paper, we propose a new approach, called MSCFS, to calculate the functional similarity of circRNA by integrating multiple data sources. We combine circRNA-disease association, circRNA-gene-Gene Ontology association, and circRNA sequence information to explore the functional similarity of circRNA. Firstly, we employ different learning representation methods from three data sources to establish three circRNA functional similarity networks. Then we integrate the three networks to obtain the final circRNA functional similarity. Results We utilize circRNA–miRNA association similarity and circRNA co-expression similarity to evaluate the performance of MSCFS. The results show a positive correlation with miRNA association (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$R=0.213$$\end{document}R=0.213) and circRNA co-expression similarity (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$R=0.8991$$\end{document}R=0.8991). Finally, we construct a circRNA functional similarity network and perform case analysis. The result shows our method can be applied to infer new potential functions of circRNA and other associations. Conclusions MSCFS combines multiple data sources related to circRNA functions. Correlation analysis and case analyses prove that MSCFS is a useful method to explore circRNA functional similarity.
Collapse
Affiliation(s)
- Liang Shu
- School of Computer Science and Engineering, Central South University, Lushangnan Road, Changsha, China
| | - Cheng Zhou
- School of Computer Science and Engineering, Central South University, Lushangnan Road, Changsha, China
| | - Xinxu Yuan
- Department of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Longxiang Road, Pingdingshan, 467000, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Lushangnan Road, Changsha, China.
| |
Collapse
|
6
|
Wang Q, Liu Z, Yan B, Chou WC, Ettwiller L, Ma Q, Liu B. A novel computational framework for genome-scale alternative transcription units prediction. Brief Bioinform 2021; 22:6265223. [PMID: 33957668 DOI: 10.1093/bib/bbab162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 03/18/2021] [Accepted: 04/07/2021] [Indexed: 11/12/2022] Open
Abstract
Alternative transcription units (ATUs) are dynamically encoded under different conditions and display overlapping patterns (sharing one or more genes) under a specific condition in bacterial genomes. Genome-scale identification of ATUs is essential for studying the emergence of human diseases caused by bacterial organisms. However, it is unrealistic to identify all ATUs using experimental techniques because of the complexity and dynamic nature of ATUs. Here, we present the first-of-its-kind computational framework, named SeqATU, for genome-scale ATU prediction based on next-generation RNA-Seq data. The framework utilizes a convex quadratic programming model to seek an optimum expression combination of all of the to-be-identified ATUs. The predicted ATUs in Escherichia coli reached a precision of 0.77/0.74 and a recall of 0.75/0.76 in the two RNA-Sequencing datasets compared with the benchmarked ATUs from third-generation RNA-Seq data. In addition, the proportion of 5'- or 3'-end genes of the predicted ATUs, having documented transcription factor binding sites and transcription termination sites, was three times greater than that of no 5'- or 3'-end genes. We further evaluated the predicted ATUs by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes functional enrichment analyses. The results suggested that gene pairs frequently encoded in the same ATUs are more functionally related than those that can belong to two distinct ATUs. Overall, these results demonstrated the high reliability of predicted ATUs. We expect that the new insights derived by SeqATU will not only improve the understanding of the transcription mechanism of bacteria but also guide the reconstruction of a genome-scale transcriptional regulatory network.
Collapse
Affiliation(s)
- Qi Wang
- School of Mathematics, Shandong University, Jinan 250200, China
| | - Zhaoqian Liu
- School of Mathematics, Shandong University, Jinan 250200, China.,Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Bo Yan
- New England Biolabs Inc., Ipswich, MA 01938, USA
| | - Wen-Chi Chou
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | | | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan 250200, China
| |
Collapse
|
7
|
Nguyen QH, Le DH. Similarity Calculation, Enrichment Analysis, and Ontology Visualization of Biomedical Ontologies using UFO. Curr Protoc 2021; 1:e115. [PMID: 33900688 DOI: 10.1002/cpz1.115] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
The rapid growth of biomedical ontologies observed in recent years has been reported to be useful in various applications. In this article, we propose two main-function protocols-term-related and entity-related-with the three most common ontology analyses, including similarity calculation, enrichment analysis, and ontology visualization, which can be done by separate methods. Many previously developed tools implementing those methods run on different platforms and implement a limited number of the methods for similarity calculation and enrichment analysis tools for a specific type of biomedical ontology, although any type can be acceptable. Moreover, depending on each application, methods have distinct advantages; thus, the greater the number of methods a tool has, the better decisions that users make. The protocol here implements all the analyses above using an advanced popular tool called UFO. UFO is a Cytoscape app that unifies most of the semantic similarity measures for between-term and between-entity similarity calculation for biomedical ontologies in OBO format, which can calculate the similarity between two sets of entities and weigh imported entity networks, as well as generate functional similarity networks. The complete protocol can be performed in 30 min and is designed for use by biologists with no prior bioinformatics training. © 2021 Wiley Periodicals LLC. Basic Protocol: Running UFO using a list of input Gene Ontology, Disease Ontology, or Human Phenotype Ontology data.
Collapse
Affiliation(s)
- Quang-Huy Nguyen
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
| | - Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam.,School of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
| |
Collapse
|
8
|
Le DH. UFO: A tool for unifying biomedical ontology-based semantic similarity calculation, enrichment analysis and visualization. PLoS One 2020; 15:e0235670. [PMID: 32645039 PMCID: PMC7347127 DOI: 10.1371/journal.pone.0235670] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 06/22/2020] [Indexed: 02/06/2023] Open
Abstract
Background Biomedical ontologies have been growing quickly and proven to be useful in many biomedical applications. Important applications of those data include estimating the functional similarity between ontology terms and between annotated biomedical entities, analyzing enrichment for a set of biomedical entities. Many semantic similarity calculation and enrichment analysis methods have been proposed for such applications. Also, a number of tools implementing the methods have been developed on different platforms. However, these tools have implemented a small number of the semantic similarity calculation and enrichment analysis methods for a certain type of biomedical ontology. Note that the methods can be applied to all types of biomedical ontologies. More importantly, each method can be dominant in different applications; thus, users have more choice with more number of methods implemented in tools. Also, more functions would facilitate their task with ontology. Results In this study, we developed a Cytoscape app, named UFO, which unifies most of the semantic similarity measures for between-term and between-entity similarity calculation for all types of biomedical ontologies in OBO format. Based on the similarity calculation, UFO can calculate the similarity between two sets of entities and weigh imported entity networks as well as generate functional similarity networks. Besides, it can perform enrichment analysis of a set of entities by different methods. Moreover, UFO can visualize structural relationships between ontology terms, annotating relationships between entities and terms, and functional similarity between entities. Finally, we demonstrated the ability of UFO through some case studies on finding the best semantic similarity measures for assessing the similarity between human disease phenotypes, constructing biomedical entity functional similarity networks for predicting disease-associated biomarkers, and performing enrichment analysis on a set of similar phenotypes. Conclusions Taken together, UFO is expected to be a tool where biomedical ontologies can be exploited for various biomedical applications. Availability UFO is distributed as a Cytoscape app, and can be downloaded freely at Cytoscape App (http://apps.cytoscape.org/apps/ufo) for non-commercial use
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
- School of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
- * E-mail:
| |
Collapse
|
9
|
Cao H, Ma Q, Chen X, Xu Y. DOOR: a prokaryotic operon database for genome analyses and functional inference. Brief Bioinform 2020; 20:1568-1577. [PMID: 28968679 DOI: 10.1093/bib/bbx088] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 06/13/2017] [Indexed: 11/14/2022] Open
Abstract
The rapid accumulation of fully sequenced prokaryotic genomes provides unprecedented information for biological studies of bacterial and archaeal organisms in a systematic manner. Operons are the basic functional units for conducting such studies. Here, we review an operon database DOOR (the Database of prOkaryotic OpeRons) that we have previously developed and continue to update. Currently, the database contains 6 975 454 computationally predicted operons in 2072 complete genomes. In addition, the database also contains the following information: (i) transcriptional units for 24 genomes derived using publicly available transcriptomic data; (ii) orthologous gene mapping across genomes; (iii) 6408 cis-regulatory motifs for transcriptional factors of some operons for 203 genomes; (iv) 3 456 718 Rho-independent terminators for 2072 genomes; as well as (v) a suite of tools in support of applications of the predicted operons. In this review, we will explain how such data are computationally derived and demonstrate how they can be used to derive a wide range of higher-level information needed for systems biology studies to tackle complex and fundamental biology questions.
Collapse
|
10
|
Yang Y, Fu X, Qu W, Xiao Y, Shen HB. MiRGOFS: a GO-based functional similarity measurement for miRNAs, with applications to the prediction of miRNA subcellular localization and miRNA-disease association. Bioinformatics 2019; 34:3547-3556. [PMID: 29718114 DOI: 10.1093/bioinformatics/bty343] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2017] [Accepted: 04/26/2018] [Indexed: 01/22/2023] Open
Abstract
Motivation Benefiting from high-throughput experimental technologies, whole-genome analysis of microRNAs (miRNAs) has been more and more common to uncover important regulatory roles of miRNAs and identify miRNA biomarkers for disease diagnosis. As a complementary information to the high-throughput experimental data, domain knowledge like the Gene Ontology and KEGG pathway is usually used to guide gene function analysis. However, functional annotation for miRNAs is scarce in the public databases. Till now, only a few methods have been proposed for measuring the functional similarity between miRNAs based on public annotation data, and these methods cover a very limited number of miRNAs, which are not applicable to large-scale miRNA analysis. Results In this paper, we propose a new method to measure the functional similarity for miRNAs, called miRGOFS, which has two notable features: (i) it adopts a new GO semantic similarity metric which considers both common ancestors and descendants of GO terms; (i) it computes similarity between GO sets in an asymmetric manner, and weights each GO term by its statistical significance. The miRGOFS-based predictor achieves an F1 of 61.2% on a benchmark dataset of miRNA localization, and AUC values of 87.7 and 81.1% on two benchmark sets of miRNA-disease association, respectively. Compared with the existing functional similarity measurements of miRNAs, miRGOFS has the advantages of higher accuracy and larger coverage of human miRNAs (over 1000 miRNAs). Availability and implementation http://www.csbio.sjtu.edu.cn/bioinf/MiRGOFS/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Xiaofeng Fu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Wenhao Qu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Yiqun Xiao
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
11
|
Cheng L, Zhao H, Wang P, Zhou W, Luo M, Li T, Han J, Liu S, Jiang Q. Computational Methods for Identifying Similar Diseases. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 18:590-604. [PMID: 31678735 PMCID: PMC6838934 DOI: 10.1016/j.omtn.2019.09.019] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 09/11/2019] [Accepted: 09/12/2019] [Indexed: 02/01/2023]
Abstract
Although our knowledge of human diseases has increased dramatically, the molecular basis, phenotypic traits, and therapeutic targets of most diseases still remain unclear. An increasing number of studies have observed that similar diseases often are caused by similar molecules, can be diagnosed by similar markers or phenotypes, or can be cured by similar drugs. Thus, the identification of diseases similar to known ones has attracted considerable attention worldwide. To this end, the associations between diseases at the molecular, phenotypic, and taxonomic levels were used to measure the pairwise similarity in diseases. The corresponding performance assessment strategies for these methods involving the terms “category-based,” “simulated-patient-based,” and “benchmark-data-based” were thus further emphasized. Then, frequently used methods were evaluated using a benchmark-data-based strategy. To facilitate the assessment of disease similarity scores, researchers have designed dozens of tools that implement these methods for calculating disease similarity. Currently, disease similarity has been advantageous in predicting noncoding RNA (ncRNA) function and therapeutic drugs for diseases. In this article, we review disease similarity methods, evaluation strategies, tools, and their applications in the biomedical community. We further evaluate the performance of these methods and discuss the current limitations and future trends for calculating disease similarity.
Collapse
Affiliation(s)
- Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Hengqiang Zhao
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Meng Luo
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Tianxin Li
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Junwei Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| | - Shulin Liu
- Systemomics Center, College of Pharmacy, and Genomics Research Center (State-Province Key Laboratories of Biomedicine-Pharmaceutics of China), Harbin Medical University, Harbin, Heilongjiang, China; Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, AB, Canada.
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| |
Collapse
|
12
|
GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes. Genomics 2019; 111:612-618. [PMID: 29604342 DOI: 10.1016/j.ygeno.2018.03.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2018] [Revised: 03/16/2018] [Accepted: 03/21/2018] [Indexed: 12/19/2022]
|
13
|
Chen KH, Wang TF, Hu YJ. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinformatics 2019; 20:308. [PMID: 31182027 PMCID: PMC6558856 DOI: 10.1186/s12859-019-2907-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Accepted: 05/17/2019] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Although various machine learning-based predictors have been developed for estimating protein-protein interactions, their performances vary with dataset and species, and are affected by two primary aspects: choice of learning algorithm, and the representation of protein pairs. To improve the performance of predicting protein-protein interactions, we exploit the synergy of multiple learning algorithms, and utilize the expressiveness of different protein-pair features. RESULTS We developed a stacked generalization scheme that integrates five learning algorithms. We also designed three types of protein-pair features based on the physicochemical properties of amino acids, gene ontology annotations, and interaction network topologies. When tested on 19 published datasets collected from eight species, the proposed approach achieved a significantly higher or comparable overall performance, compared with seven competitive predictors. CONCLUSION We introduced an ensemble learning approach for PPI prediction that integrated multiple learning algorithms and different protein-pair representations. The extensive comparisons with other state-of-the-art prediction tools demonstrated the feasibility and superiority of the proposed method.
Collapse
Affiliation(s)
- Kuan-Hsi Chen
- College of Computer Science, National Chiao Tung University, Hsinchu, 300, Taiwan
| | - Tsai-Feng Wang
- Institute of Data Science and Engineering, National Chiao Tung University, Hsinchu, 300, Taiwan
| | - Yuh-Jyh Hu
- Institute of Biomedical Engineering, College of Computer Science, National Chiao Tung University, Hsinchu, 300, Taiwan.
| |
Collapse
|
14
|
Fredrich B, Schmöhl M, Junge O, Gundlach S, Ellinghaus D, Pfeufer A, Bettecken T, Siddiqui R, Franke A, Wienker TF, Hoeppner MP, Krawczak M. VarWatch-A stand-alone software tool for variant matching. PLoS One 2019; 14:e0215618. [PMID: 31022234 PMCID: PMC6483337 DOI: 10.1371/journal.pone.0215618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Accepted: 04/04/2019] [Indexed: 11/19/2022] Open
Abstract
Massively parallel DNA sequencing of clinical samples holds great promise for the gene-based diagnosis of human inherited diseases because it allows rapid detection of putatively causative mutations at genome-wide level. Without additional evidence complementing their initial bioinformatics evaluation, however, the clinical relevance of such candidate genetic variants often remains unclear. In consequence, dedicated 'matching' services have been established in recent years that aim at the discovery of other, comparable case reports to facilitate individual diagnoses. However, legal concerns have been raised about the global sharing of genetic data, particularly in Europe where the recently enacted General Data Protection Regulation EU-2016/679 classifies genetic data as highly sensitive. Hence, unrestricted sharing of genetic data from clinical cases on platforms outside the national jurisdiction increasingly may be perceived as problematic. To allow collaborative data producers, particularly large consortia of diagnostic laboratories, to acknowledge these concerns while still practicing efficient case matching internally, novel tools are required. To this end, we developed VarWatch, an easy-to-deploy and highly scalable case matching software that provides users with comprehensive programmatic tools and a user-friendly interface to fulfil said purpose.
Collapse
Affiliation(s)
- Broder Fredrich
- Institute of Clinical Molecular Biology, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Marcus Schmöhl
- Institute of Clinical Molecular Biology, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Olaf Junge
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Sven Gundlach
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Arne Pfeufer
- Humangenetische Praxis PD Dr. Pfeufer, München, Germany
- MVZ für Molekulardiagnostik GmbH, München, Germany
- Myriad GmbH, Martinsried, Germany
| | | | - Roman Siddiqui
- TMF – Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V., Berlin, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | | | - Marc P. Hoeppner
- Institute of Clinical Molecular Biology, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Michael Krawczak
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel, Germany
| |
Collapse
|
15
|
Das B, Patil AR, Mitra P. A network-based zoning for parallel whole-cell simulation. Bioinformatics 2019; 35:88-94. [PMID: 29955764 DOI: 10.1093/bioinformatics/bty530] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 06/27/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation In Computational Cell Biology, whole-cell modeling and simulation is an absolute requirement to analyze and explore the cell of an organism. Despite few individual efforts on modeling, the prime obstacle hindering its development and progress is its compute-intensive nature. Towards this end, little knowledge is available on how to reduce the enormous computational overhead and which computational systems will be of use. Results In this article, we present a network-based zoning approach that could potentially be utilized in the parallelization of whole-cell simulations. Firstly, we construct the protein-protein interaction graph of the whole-cell of an organism using experimental data from various sources. Based on protein interaction information, we predict protein locality and allocate confidence score to the interactions accordingly. We then identify the modules of strictly localized interacting proteins by performing interaction graph clustering based on the confidence score of the interactions. By applying this method to Escherichia coli K12, we identified 188 spatially localized clusters. After a thorough Gene Ontology-based analysis, we proved that the clusters are also in functional proximity. We then conducted Principal Coordinates Analysis to predict the spatial distribution of the clusters in the simulation space. Our automated computational techniques can partition the entire simulation space (cell) into simulation sub-cells. Each of these sub-cells can be simulated on separate computing units of the High-Performance Computing (HPC) systems. We benchmarked our method using proteins. However, our method can be extended easily to add other cellular components like DNA, RNA and metabolites. Availability and implementation . Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Barnali Das
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| | - Abhijeet Rajendra Patil
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| | - Pralay Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| |
Collapse
|
16
|
Zhang J, Jia K, Jia J, Qian Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinformatics 2018; 19:161. [PMID: 29699476 PMCID: PMC5921294 DOI: 10.1186/s12859-018-2152-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 04/09/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Comparing and classifying functions of gene products are important in today's biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. RESULTS We propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by "is_a" and "part_of" relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. CONCLUSIONS HVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM .
Collapse
Affiliation(s)
- Jiongmin Zhang
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Ke Jia
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Jinmeng Jia
- School of life science, East China Normal University, Dongchuan Road, Shanghai, 200241 China
| | - Ying Qian
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| |
Collapse
|
17
|
Peng J, Zhang X, Hui W, Lu J, Li Q, Liu S, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC SYSTEMS BIOLOGY 2018; 12:18. [PMID: 29560823 PMCID: PMC5861498 DOI: 10.1186/s12918-018-0539-0] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
BACKGROUND Gene Ontology (GO) is one of the most popular bioinformatics resources. In the past decade, Gene Ontology-based gene semantic similarity has been effectively used to model gene-to-gene interactions in multiple research areas. However, most existing semantic similarity approaches rely only on GO annotations and structure, or incorporate only local interactions in the co-functional network. This may lead to inaccurate GO-based similarity resulting from the incomplete GO topology structure and gene annotations. RESULTS We present NETSIM2, a new network-based method that allows researchers to measure GO-based gene functional similarities by considering the global structure of the co-functional network with a random walk with restart (RWR)-based method, and by selecting the significant term pairs to decrease the noise information. Based on the EC number (Enzyme Commission)-based groups of yeast and Arabidopsis, evaluation test shows that NETSIM2 can enhance the accuracy of Gene Ontology-based gene functional similarity. CONCLUSIONS Using NETSIM2 as an example, we found that the accuracy of semantic similarities can be significantly improved after effectively incorporating the global gene-to-gene interactions in the co-functional network, especially on the species that gene annotations in GO are far from complete.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China. .,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, China. .,Centre for Multidisciplinary Convergence Computing (CMCC), School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Xuanshuo Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Qianqian Li
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Shuhui Liu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, China
| |
Collapse
|
18
|
Zhou H, Yang Y, Shen HB. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics 2017; 33:843-853. [PMID: 27993784 DOI: 10.1093/bioinformatics/btw723] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Accepted: 11/17/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Protein subcellular localization prediction has been an important research topic in computational biology over the last decade. Various automatic methods have been proposed to predict locations for large scale protein datasets, where statistical machine learning algorithms are widely used for model construction. A key step in these predictors is encoding the amino acid sequences into feature vectors. Many studies have shown that features extracted from biological domains, such as gene ontology and functional domains, can be very useful for improving the prediction accuracy. However, domain knowledge usually results in redundant features and high-dimensional feature spaces, which may degenerate the performance of machine learning models. Results In this paper, we propose a new amino acid sequence-based human protein subcellular location prediction approach Hum-mPLoc 3.0, which covers 12 human subcellular localizations. The sequences are represented by multi-view complementary features, i.e. context vocabulary annotation-based gene ontology (GO) terms, peptide-based functional domains, and residue-based statistical features. To systematically reflect the structural hierarchy of the domain knowledge bases, we propose a novel feature representation protocol denoted as HCM (Hidden Correlation Modeling), which will create more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Experimental results on four benchmark datasets show that HCM improves prediction accuracy by 5-11% and F 1 by 8-19% compared with conventional GO-based methods. A large-scale application of Hum-mPLoc 3.0 on the whole human proteome reveals proteins co-localization preferences in the cell. Availability and Implementation www.csbio.sjtu.edu.cn/bioinf/Hum-mPLoc3/. Contacts hbshen@sjtu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hang Zhou
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China.,Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
19
|
Kang H, Gong Y. Developing a similarity searching module for patient safety event reporting system using semantic similarity measures. BMC Med Inform Decis Mak 2017; 17:75. [PMID: 28699567 PMCID: PMC5506579 DOI: 10.1186/s12911-017-0467-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background The most important knowledge in the field of patient safety is regarding the prevention and reduction of patient safety events (PSE) during treatment and care. The similarities and patterns among the events may otherwise go unnoticed if they are not properly reported and analyzed. There is an urgent need for developing a PSE reporting system that can dynamically measure the similarities of the events and thus promote event analysis and learning effect. Methods In this study, three prevailing algorithms of semantic similarity were implemented to measure the similarities of the 366 PSE annotated by the taxonomy of The Agency for Healthcare Research and Quality (AHRQ). The performance of each algorithm was then evaluated by a group of domain experts based on a 4-point Likert scale. The consistency between the scales of the algorithms and experts was measured and compared with the scales randomly assigned. The similarity algorithms and scores, as a self-learning and self-updating module, were then integrated into the system. Results The result shows that the similarity scores reflect a high consistency with the experts’ review than those randomly assigned. Moreover, incorporating the algorithms into our reporting system enables a mechanism to learn and update based upon PSE similarity. Conclusion In conclusion, integrating semantic similarity algorithms into a PSE reporting system can help us learn from previous events and provide timely knowledge support to the reporters. With the knowledge base in the PSE domain, the new generation reporting system holds promise in educating healthcare providers and preventing the recurrence and serious consequences of PSE.
Collapse
Affiliation(s)
- Hong Kang
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA
| | - Yang Gong
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA.
| |
Collapse
|
20
|
Tian Z, Wang C, Guo M, Liu X, Teng Z. An improved method for functional similarity analysis of genes based on Gene Ontology. BMC SYSTEMS BIOLOGY 2016; 10:119. [PMID: 28155727 PMCID: PMC5259995 DOI: 10.1186/s12918-016-0359-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background Measures of gene functional similarity are essential tools for gene clustering, gene function prediction, evaluation of protein-protein interaction, disease gene prioritization and other applications. In recent years, many gene functional similarity methods have been proposed based on the semantic similarity of GO terms. However, these leading approaches may make errorprone judgments especially when they measure the specificity of GO terms as well as the IC of a term set. Therefore, how to estimate the gene functional similarity reliably is still a challenging problem. Results We propose WIS, an effective method to measure the gene functional similarity. First of all, WIS computes the IC of a term by employing its depth, the number of its ancestors as well as the topology of its descendants in the GO graph. Secondly, WIS calculates the IC of a term set by means of considering the weighted inherited semantics of terms. Finally, WIS estimates the gene functional similarity based on the IC overlap ratio of term sets. WIS is superior to some other representative measures on the experiments of functional classification of genes in a biological pathway, collaborative evaluation of GO-based semantic similarity measures, protein-protein interaction prediction and correlation with gene expression. Further analysis suggests that WIS takes fully into account the specificity of terms and the weighted inherited semantics of terms between GO terms. Conclusions The proposed WIS method is an effective and reliable way to compare gene function. The web service of WIS is freely available at http://nclab.hit.edu.cn/WIS/. Electronic supplementary material The online version of this article (doi:10.1186/s12918-016-0359-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhen Tian
- Department of computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Chunyu Wang
- Department of computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Maozu Guo
- Department of computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| | - Xiaoyan Liu
- Department of computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Zhixia Teng
- Department of computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.,Department of Information Management and Information System, Northeast Forestry University, Harbin, 150001, People's Republic of China
| |
Collapse
|
21
|
Ben Aouicha M, Hadj Taieb MA, Ben Hamadou A. SISR: System for integrating semantic relatedness and similarity measures. Soft comput 2016. [DOI: 10.1007/s00500-016-2438-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
22
|
Tian Z, Wang C, Guo M, Liu X, Teng Z. SGFSC: speeding the gene functional similarity calculation based on hash tables. BMC Bioinformatics 2016; 17:445. [PMID: 27814675 PMCID: PMC5096311 DOI: 10.1186/s12859-016-1294-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Accepted: 10/19/2016] [Indexed: 12/23/2022] Open
Abstract
Background In recent years, many measures of gene functional similarity have been proposed and widely used in all kinds of essential research. These methods are mainly divided into two categories: pairwise approaches and group-wise approaches. However, a common problem with these methods is their time consumption, especially when measuring the gene functional similarities of a large number of gene pairs. The problem of computational efficiency for pairwise approaches is even more prominent because they are dependent on the combination of semantic similarity. Therefore, the efficient measurement of gene functional similarity remains a challenging problem. Results To speed current gene functional similarity calculation methods, a novel two-step computing strategy is proposed: (1) establish a hash table for each method to store essential information obtained from the Gene Ontology (GO) graph and (2) measure gene functional similarity based on the corresponding hash table. There is no need to traverse the GO graph repeatedly for each method with the help of the hash table. The analysis of time complexity shows that the computational efficiency of these methods is significantly improved. We also implement a novel Speeding Gene Functional Similarity Calculation tool, namely SGFSC, which is bundled with seven typical measures using our proposed strategy. Further experiments show the great advantage of SGFSC in measuring gene functional similarity on the whole genomic scale. Conclusions The proposed strategy is successful in speeding current gene functional similarity calculation methods. SGFSC is an efficient tool that is freely available at http://nclab.hit.edu.cn/SGFSC. The source code of SGFSC can be downloaded from http://pan.baidu.com/s/1dFFmvpZ.
Collapse
Affiliation(s)
- Zhen Tian
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China
| | - Zhixia Teng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, People's Republic of China.,Department of Information Management and Information System, Northeast Forestry University, Harbin, 150001, People's Republic of China
| |
Collapse
|
23
|
Luo J, Lin D, Cao B. A cell-core-attachment approach for identifying protein complexes in yeast protein-protein interaction network. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/jifs-169026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
24
|
Liu B, Zhou C, Li G, Zhang H, Zeng E, Liu Q, Ma Q. Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses. Sci Rep 2016; 6:23030. [PMID: 26975728 PMCID: PMC4792141 DOI: 10.1038/srep23030] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 02/22/2016] [Indexed: 12/18/2022] Open
Abstract
Regulons are the basic units of the response system in a bacterial cell, and each consists of a set of transcriptionally co-regulated operons. Regulon elucidation is the basis for studying the bacterial global transcriptional regulation network. In this study, we designed a novel co-regulation score between a pair of operons based on accurate operon identification and cis regulatory motif analyses, which can capture their co-regulation relationship much better than other scores. Taking full advantage of this discovery, we developed a new computational framework and built a novel graph model for regulon prediction. This model integrates the motif comparison and clustering and makes the regulon prediction problem substantially more solvable and accurate. To evaluate our prediction, a regulon coverage score was designed based on the documented regulons and their overlap with our prediction; and a modified Fisher Exact test was implemented to measure how well our predictions match the co-expressed modules derived from E. coli microarray gene-expression datasets collected under 466 conditions. The results indicate that our program consistently performed better than others in terms of the prediction accuracy. This suggests that our algorithms substantially improve the state-of-the-art, leading to a computational capability to reliably predict regulons for any bacteria.
Collapse
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Chuan Zhou
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Hanyuan Zhang
- Systems Biology and Biomedical Informatics (SBBI) Laboratory University of Nebraska-Lincoln, Lincoln, NE 68588-0115, USA
| | - Erliang Zeng
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA.,Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA.,BioSNTR, Brookings, SD, USA
| | - Qi Liu
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Qin Ma
- Department of Plant Science, South Dakota State University, Brookings, SD, 57006, USA.,BioSNTR, Brookings, SD, USA
| |
Collapse
|
25
|
Yang Y, Xu Z, Song D. Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics 2016; 17 Suppl 1:10. [PMID: 26818962 PMCID: PMC4895707 DOI: 10.1186/s12859-015-0853-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Missing values are commonly present in microarray data profiles. Instead of discarding genes or samples with incomplete expression level, missing values need to be properly imputed for accurate data analysis. The imputation methods can be roughly categorized as expression level-based and domain knowledge-based. The first type of methods only rely on expression data without the help of external data sources, while the second type incorporates available domain knowledge into expression data to improve imputation accuracy. In recent years, microRNA (miRNA) microarray has been largely developed and used for identifying miRNA biomarkers in complex human disease studies. Similar to mRNA profiles, miRNA expression profiles with missing values can be treated with the existing imputation methods. However, the domain knowledge-based methods are hard to be applied due to the lack of direct functional annotation for miRNAs. With the rapid accumulation of miRNA microarray data, it is increasingly needed to develop domain knowledge-based imputation algorithms specific to miRNA expression profiles to improve the quality of miRNA data analysis. RESULTS We connect miRNAs with domain knowledge of Gene Ontology (GO) via their target genes, and define miRNA functional similarity based on the semantic similarity of GO terms in GO graphs. A new measure combining miRNA functional similarity and expression similarity is used in the imputation of missing values. The new measure is tested on two miRNA microarray datasets from breast cancer research and achieves improved performance compared with the expression-based method on both datasets. CONCLUSIONS The experimental results demonstrate that the biological domain knowledge can benefit the estimation of missing values in miRNA profiles as well as mRNA profiles. Especially, functional similarity defined by GO terms annotated for the target genes of miRNAs can be useful complementary information for the expression-based method to improve the imputation accuracy of miRNA array data. Our method and data are available to the public upon request.
Collapse
Affiliation(s)
- Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China. .,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.
| | - Zhuangdi Xu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China.
| | - Dandan Song
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing100081, China.
| |
Collapse
|
26
|
Mao X, Ma Q, Liu B, Chen X, Zhang H, Xu Y. Revisiting operons: an analysis of the landscape of transcriptional units in E. coli. BMC Bioinformatics 2015; 16:356. [PMID: 26538447 PMCID: PMC4634151 DOI: 10.1186/s12859-015-0805-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2015] [Accepted: 10/29/2015] [Indexed: 11/21/2022] Open
Abstract
Background Bacterial operons are considerably more complex than what were thought. At least their components are dynamically rather than statically defined as previously assumed. Here we present a computational study of the landscape of the transcriptional units (TUs) of E. coli K12, revealed by the available genomic and transcriptomic data, providing new understanding about the complexity of TUs as a whole encoded in the genome of E. coli K12. Results and conclusion Our main findings include that (i) different TUs may overlap with each other by sharing common genes, giving rise to clusters of overlapped TUs (TUCs) along the genomic sequence; (ii) the intergenic regions in front of the first gene of each TU tend to have more conserved sequence motifs than those of the other genes inside the TU, suggesting that TUs each have their own promoters; (iii) the terminators associated with the 3’ ends of TUCs tend to be Rho-independent terminators, substantially more often than terminators of TUs that end inside a TUC; and (iv) the functional relatedness of adjacent gene pairs in individual TUs is higher than those in TUCs, suggesting that individual TUs are more basic functional units than TUCs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0805-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xizeng Mao
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, USA. .,Present address: MD Anderson Cancer Center, Houston, TX, 77054, USA.
| | - Qin Ma
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, USA. .,BioEnergy Research Center (BESC), Athens, GA, USA. .,Present address: Department of Plant Science, South Dakota State University, Brookings, SD, 57006, USA. .,Present address: BioSNTR, Brookings, SD, USA.
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong, China.
| | - Xin Chen
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, USA. .,College of Computer Sciences and Technology, Changchun, Jilin, China.
| | - Hanyuan Zhang
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, USA. .,Present address: Systems Biology and Biomedical Informatics (SBBI) Laboratory University of Nebraska-Lincoln 122B/122C Avery Hall, 1144 T St, Lincoln, NE, 68588-0115, USA.
| | - Ying Xu
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, USA. .,BioEnergy Research Center (BESC), Athens, GA, USA. .,College of Computer Sciences and Technology, Changchun, Jilin, China. .,School of Public Health, Jilin University, Changchun, Jilin, China.
| |
Collapse
|
27
|
Peng J, Li H, Jiang Q, Wang Y, Chen J. An integrative approach for measuring semantic similarities using gene ontology. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 5:S8. [PMID: 25559943 PMCID: PMC4305987 DOI: 10.1186/1752-0509-8-s5-s8] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Background Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding. Results We propose a novel integrative measure called InteGO2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories. Conclusions InteGO2 computes gene-to-gene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
Collapse
|
28
|
Konopka BM, Golda T, Kotulska M. Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology. J Comput Biol 2014; 21:809-22. [DOI: 10.1089/cmb.2014.0181] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Affiliation(s)
- Bogumil M. Konopka
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| | - Tomasz Golda
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Kotulska
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland
| |
Collapse
|
29
|
Sudhakar P, Reck M, Wang W, He FQ, Wagner-Döbler I, Dobler IW, Zeng AP. Construction and verification of the transcriptional regulatory response network of Streptococcus mutans upon treatment with the biofilm inhibitor carolacton. BMC Genomics 2014; 15:362. [PMID: 24884510 PMCID: PMC4048456 DOI: 10.1186/1471-2164-15-362] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Accepted: 04/17/2014] [Indexed: 11/26/2022] Open
Abstract
Background Carolacton is a newly identified secondary metabolite causing altered cell morphology and death of Streptococcus mutans biofilm cells. To unravel key regulators mediating these effects, the transcriptional regulatory response network of S. mutans biofilms upon carolacton treatment was constructed and analyzed. A systems biological approach integrating time-resolved transcriptomic data, reverse engineering, transcription factor binding sites, and experimental validation was carried out. Results The co-expression response network constructed from transcriptomic data using the reverse engineering algorithm called the Trend Correlation method consisted of 8284 gene pairs. The regulatory response network inferred by superimposing transcription factor binding site information into the co-expression network comprised 329 putative transcriptional regulatory interactions and could be classified into 27 sub-networks each co-regulated by a transcription factor. These sub-networks were significantly enriched with genes sharing common functions. The regulatory response network displayed global hierarchy and network motifs as observed in model organisms. The sub-networks modulated by the pyrimidine biosynthesis regulator PyrR, the glutamine synthetase repressor GlnR, the cysteine metabolism regulator CysR, global regulators CcpA and CodY and the two component system response regulators VicR and MbrC among others could putatively be related to the physiological effect of carolacton. The predicted interactions from the regulatory network between MbrC, known to be involved in cell envelope stress response, and the murMN-SMU_718c genes encoding peptidoglycan biosynthetic enzymes were experimentally confirmed using Electro Mobility Shift Assays. Furthermore, gene deletion mutants of five predicted key regulators from the response networks were constructed and their sensitivities towards carolacton were investigated. Deletion of cysR, the node having the highest connectivity among the regulators chosen from the regulatory network, resulted in a mutant which was insensitive to carolacton thus demonstrating not only the essentiality of cysR for the response of S. mutans biofilms to carolacton but also the relevance of the predicted network. Conclusion The network approach used in this study revealed important regulators and interactions as part of the response mechanisms of S. mutans biofilm cells to carolacton. It also opens a door for further studies into novel drug targets against streptococci. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-362) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | - Irene W Dobler
- Institute of Bioprocess and Biosystems Engineering, Hamburg University of Technology, 21073 Hamburg, Germany.
| | | |
Collapse
|
30
|
Song X, Li L, Srimani PK, Yu PS, Wang JZ. Measure the Semantic Similarity of GO Terms Using Aggregate Information Content. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:468-476. [PMID: 26356015 DOI: 10.1109/tcbb.2013.176] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The rapid development of gene ontology (GO) and huge amount of biomedical data annotated by GO terms necessitate computation of semantic similarity of GO terms and, in turn, measurement of functional similarity of genes based on their annotations. In this paper we propose a novel and efficient method to measure the semantic similarity of GO terms. The proposed method addresses the limitations in existing GO term similarity measurement techniques; it computes the semantic content of a GO term by considering the information content of all of its ancestor terms in the graph. The aggregate information content (AIC) of all ancestor terms of a GO term implicitly reflects the GO term's location in the GO graph and also represents how human beings use this GO term and all its ancestor terms to annotate genes. We show that semantic similarity of GO terms obtained by our method closely matches the human perception. Extensive experimental studies show that this novel method also outperforms all existing methods in terms of the correlation with gene expression data. We have developed web services for measuring semantic similarity of GO terms and functional similarity of genes using the proposed AIC method and other popular methods. These web services are available at http://bioinformatics.clemson.edu/G-SESAME.
Collapse
|
31
|
HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 2014; 9:e89545. [PMID: 24647341 PMCID: PMC3960097 DOI: 10.1371/journal.pone.0089545] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 01/23/2014] [Indexed: 12/23/2022] Open
Abstract
Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.
Collapse
|
32
|
ŽITNIK MARINKA, ZUPAN BLAŽ. Matrix factorization-based data fusion for gene function prediction in baker's yeast and slime mold. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2014:400-411. [PMID: 24297565 PMCID: PMC3902649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The development of effective methods for the characterization of gene functions that are able to combine diverse data sources in a sound and easily-extendible way is an important goal in computational biology. We have previously developed a general matrix factorization-based data fusion approach for gene function prediction. In this manuscript, we show that this data fusion approach can be applied to gene function prediction and that it can fuse various heterogeneous data sources, such as gene expression profiles, known protein annotations, interaction and literature data. The fusion is achieved by simultaneous matrix tri-factorization that shares matrix factors between sources. We demonstrate the effectiveness of the approach by evaluating its performance on predicting ontological annotations in slime mold D. discoideum and on recognizing proteins of baker's yeast S. cerevisiae that participate in the ribosome or are located in the cell membrane. Our approach achieves predictive performance comparable to that of the state-of-the-art kernel-based data fusion, but requires fewer data preprocessing steps.
Collapse
Affiliation(s)
- MARINKA ŽITNIK
- Faculty of Computer and Information Science, University of Ljubljana, Tržaška 25, SI-1000, Slovenia,
| | - BLAŽ ZUPAN
- Faculty of Computer and Information Science, University of Ljubljana, Tržaška 25, SI-1000, Slovenia; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX-77030, USA,
| |
Collapse
|
33
|
Eksi R, Li HD, Menon R, Wen Y, Omenn GS, Kretzler M, Guan Y. Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data. PLoS Comput Biol 2013; 9:e1003314. [PMID: 24244129 PMCID: PMC3820534 DOI: 10.1371/journal.pcbi.1003314] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2013] [Accepted: 09/19/2013] [Indexed: 12/13/2022] Open
Abstract
Integrating large-scale functional genomic data has significantly accelerated our understanding of gene functions. However, no algorithm has been developed to differentiate functions for isoforms of the same gene using high-throughput genomic data. This is because standard supervised learning requires ‘ground-truth’ functional annotations, which are lacking at the isoform level. To address this challenge, we developed a generic framework that interrogates public RNA-seq data at the transcript level to differentiate functions for alternatively spliced isoforms. For a specific function, our algorithm identifies the ‘responsible’ isoform(s) of a gene and generates classifying models at the isoform level instead of at the gene level. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. We identified genes in the mouse whose isoforms are predicted to have disparate functionalities and experimentally validated the ‘responsible’ isoforms using data from mammary tissue. With protein structure modeling and experimental evidence, we further validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions. In mammalian genomes, a single gene can be alternatively spliced into multiple isoforms which greatly increase the functional diversity of the genome. In the human, more than 95% of multi-exon genes undergo alternative splicing. It is hard to computationally differentiate the functions for the splice isoforms of the same gene, because they are almost always annotated with the same functions and share similar sequences. In this paper, we developed a generic framework to identify the ‘responsible’ isoform(s) for each function that the gene carries out, and therefore predict functional assignment on the isoform level instead of on the gene level. Within this generic framework, we implemented and evaluated several related algorithms for isoform function prediction. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm represents the first effort to predict and differentiate isoforms through large-scale genomic data integration.
Collapse
Affiliation(s)
- Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yuchen Wen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Matthias Kretzler
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| |
Collapse
|
34
|
Wu X, Pang E, Lin K, Pei ZM. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS One 2013; 8:e66745. [PMID: 23741529 PMCID: PMC3669204 DOI: 10.1371/journal.pone.0066745] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Accepted: 05/10/2013] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Explicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Previously, we presented an edge-based method, Relative Specificity Similarity (RSS), which takes the global position of relevant terms into account. However, edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and simply consider terms at the same level in the ontology to be equally specific nodes, revealing the weaknesses that could be complemented using information content (IC). RESULTS AND CONCLUSIONS Here, we used the IC-based nodes to improve RSS and proposed a new method, Hybrid Relative Specificity Similarity (HRSS). HRSS outperformed other methods in distinguishing true protein-protein interactions from false. HRSS values were divided into four different levels of confidence for protein interactions. In addition, HRSS was statistically the best at obtaining the highest average functional similarity among human-mouse orthologs. Both HRSS and the groupwise measure, simGIC, are superior in correlation with sequence and Pfam similarities. Because different measures are best suited for different circumstances, we compared two pairwise strategies, the maximum and the best-match average, in the evaluation. The former was more effective at inferring physical protein-protein interactions, and the latter at estimating the functional conservation of orthologs and analyzing the CESSM datasets. In conclusion, HRSS can be applied to different biological problems by quantifying the functional similarity between gene products. The algorithm HRSS was implemented in the C programming language, which is freely available from http://cmb.bnu.edu.cn/hrss.
Collapse
Affiliation(s)
- Xiaomei Wu
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, People's Republic of China.
| | | | | | | |
Collapse
|
35
|
Zhang J, Li L, Peng L, Sun Y, Li J. An efficient weighted graph strategy to identify differentiation associated genes in embryonic stem cells. PLoS One 2013; 8:e62716. [PMID: 23638139 PMCID: PMC3637163 DOI: 10.1371/journal.pone.0062716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 03/25/2013] [Indexed: 11/18/2022] Open
Abstract
In the past few decades, embryonic stem cells (ESCs) were of great interest as a model system for studying early developmental processes and because of their potential therapeutic applications in regenerative medicine. However, the underlying mechanisms of ESC differentiation remain unclear, which limits our exploration of the therapeutic potential of stem cells. Fortunately, the increasing quantity and diversity of biological datasets can provide us with opportunities to explore the biological secrets. However, taking advantage of diverse biological information to facilitate the advancement of ESC research still remains a challenge. Here, we propose a scalable, efficient and flexible function prediction framework that integrates diverse biological information using a simple weighted strategy, for uncovering the genetic determinants of mouse ESC differentiation. The advantage of this approach is that it can make predictions based on dynamic information fusion, owing to the simple weighted strategy. With this approach, we identified 30 genes that had been reported to be associated with differentiation of stem cells, which we regard to be associated with differentiation or pluripotency in embryonic stem cells. We also predicted 70 genes as candidates for contributing to differentiation, which requires further confirmation. As a whole, our results showed that this strategy could be applied as a useful tool for ESC research.
Collapse
Affiliation(s)
- Jie Zhang
- Department of Prevention, Tongji University School of Medicine, Shanghai, China
- * E-mail: (JZ); (JL)
| | - Li Li
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University School of Medicine, Shanghai, China
| | - Luying Peng
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University School of Medicine, Shanghai, China
| | - Yingxian Sun
- Department of Cardiology, The First Hospital of China Medical University, Shenyang, China
| | - Jue Li
- Department of Prevention, Tongji University School of Medicine, Shanghai, China
- * E-mail: (JZ); (JL)
| |
Collapse
|
36
|
Pradhan MP, Nagulapalli K, Palakal MJ. Cliques for the identification of gene signatures for colorectal cancer across population. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S17. [PMID: 23282040 PMCID: PMC3524317 DOI: 10.1186/1752-0509-6-s3-s17] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Background Colorectal cancer (CRC) is one of the most commonly diagnosed cancers worldwide. Studies have correlated risk of CRC development with dietary habits and environmental conditions. Gene signatures for any disease can identify the key biological processes, which is especially useful in studying cancer development. Such processes can be used to evaluate potential drug targets. Though recognition of CRC gene-signatures across populations is crucial to better understanding potential novel treatment options for CRC, it remains a challenging task. Results We developed a topological and biological feature-based network approach for identifying the gene signatures across populations. In this work, we propose a novel approach of using cliques to understand the variability within population. Cliques are more conserved and co-expressed, therefore allowing identification and comparison of cliques across a population which can help researchers study gene variations. Our study was based on four publicly available expression datasets belonging to four different populations across the world. We identified cliques of various sizes (0 to 7) across the four population networks. Cliques of size seven were further analyzed across populations for their commonality and uniqueness. Forty-nine common cliques of size seven were identified. These cliques were further analyzed based on their connectivity profiles. We found associations between the cliques and their connectivity profiles across networks. With these clique connectivity profiles (CCPs), we were able to identify the divergence among the populations, important biological processes (cell cycle, signal transduction, and cell differentiation), and related gene pathways. Therefore the genes identified in these cliques and their connectivity profiles can be defined as the gene-signatures across populations. In this work we demonstrate the power and effectiveness of cliques to study CRC across populations. Conclusions We developed a new approach where cliques and their connectivity profiles helped elucidate the variation and similarity in CRC gene profiles across four populations with unique dietary habits.
Collapse
Affiliation(s)
- Meeta P Pradhan
- School of Informatics, Indiana University Purdue University Indianapolis, IN, USA
| | | | | |
Collapse
|
37
|
A sensitive method for computing GO-based functional similarities among genes with ‘shallow annotation’. Gene 2012; 509:131-5. [DOI: 10.1016/j.gene.2012.07.078] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2012] [Accepted: 07/31/2012] [Indexed: 11/22/2022]
|
38
|
Lemay DG, Martin WF, Hinrichs AS, Rijnkels M, German JB, Korf I, Pollard KS. G-NEST: a gene neighborhood scoring tool to identify co-conserved, co-expressed genes. BMC Bioinformatics 2012; 13:253. [PMID: 23020263 PMCID: PMC3575404 DOI: 10.1186/1471-2105-13-253] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2012] [Accepted: 09/23/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In previous studies, gene neighborhoods-spatial clusters of co-expressed genes in the genome-have been defined using arbitrary rules such as requiring adjacency, a minimum number of genes, a fixed window size, or a minimum expression level. In the current study, we developed a Gene Neighborhood Scoring Tool (G-NEST) which combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all possible window sizes simultaneously. RESULTS Using G-NEST on atlases of mouse and human tissue expression data, we found that large neighborhoods of ten or more genes are extremely rare in mammalian genomes. When they do occur, neighborhoods are typically composed of families of related genes. Both the highest scoring and the largest neighborhoods in mammalian genomes are formed by tandem gene duplication. Mammalian gene neighborhoods contain highly and variably expressed genes. Co-localized noisy gene pairs exhibit lower evolutionary conservation of their adjacent genome locations, suggesting that their shared transcriptional background may be disadvantageous. Genes that are essential to mammalian survival and reproduction are less likely to occur in neighborhoods, although neighborhoods are enriched with genes that function in mitosis. We also found that gene orientation and protein-protein interactions are partially responsible for maintenance of gene neighborhoods. CONCLUSIONS Our experiments using G-NEST confirm that tandem gene duplication is the primary driver of non-random gene order in mammalian genomes. Non-essentiality, co-functionality, gene orientation, and protein-protein interactions are additional forces that maintain gene neighborhoods, especially those formed by tandem duplicates. We expect G-NEST to be useful for other applications such as the identification of core regulatory modules, common transcriptional backgrounds, and chromatin domains. The software is available at http://docpollard.org/software.html.
Collapse
Affiliation(s)
- Danielle G Lemay
- Genome Center, University of California Davis, 451 Health Science Dr, Davis, CA, 95616, United States of America.
| | | | | | | | | | | | | |
Collapse
|
39
|
Mining functional gene modules linked with rheumatoid arthritis using a SNP-SNP network. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:23-34. [PMID: 22449398 PMCID: PMC5054489 DOI: 10.1016/s1672-0229(11)60030-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2011] [Accepted: 08/31/2011] [Indexed: 11/21/2022]
Abstract
The identification of functional gene modules that are derived from integration of information from different types of networks is a powerful strategy for interpreting the etiology of complex diseases such as rheumatoid arthritis (RA). Genetic variants are known to increase the risk of developing RA. Here, a novel method, the construction of a genetic network, was used to mine functional gene modules linked with RA. A polymorphism interaction analysis (PIA) algorithm was used to obtain cooperating single nucleotide polymorphisms (SNPs) that contribute to RA disease. The acquired SNP pairs were used to construct a SNP-SNP network. Sub-networks defined by hub SNPs were then extracted and turned into gene modules by mapping SNPs to genes using dbSNP database. We performed Gene Ontology (GO) analysis on each gene module, and some GO terms enriched in the gene modules can be used to investigate clustered gene function for better understanding RA pathogenesis. This method was applied to the Genetic Analysis Workshop 15 (GAW 15) RA dataset. The results show that genes involved in functional gene modules, such as CD160 (rs744877) and RUNX1 (rs2051179), are especially relevant to RA, which is supported by previous reports. Furthermore, the 43 SNPs involved in the identified gene modules were found to be the best classifiers when used as variables for sample classification.
Collapse
|
40
|
Zhu P, Gu H, Jiao Y, Huang D, Chen M. Computational identification of protein-protein interactions in rice based on the predicted rice interactome network. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 9:128-37. [PMID: 22196356 PMCID: PMC5054448 DOI: 10.1016/s1672-0229(11)60016-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Accepted: 07/04/2011] [Indexed: 01/29/2023]
Abstract
Plant protein-protein interaction networks have not been identified by large-scale experiments. In order to better understand the protein interactions in rice, the Predicted Rice Interactome Network (PRIN; http://bis.zju.edu.cn/prin/) presented 76,585 predicted interactions involving 5,049 rice proteins. After mapping genomic features of rice (GO annotation, subcellular localization prediction, and gene expression), we found that a well-annotated and biologically significant network is rich enough to capture many significant functional linkages within higher-order biological systems, such as pathways and biological processes. Furthermore, we took MADS-box domain-containing proteins and circadian rhythm signaling pathways as examples to demonstrate that functional protein complexes and biological pathways could be effectively expanded in our predicted network. The expanded molecular network in PRIN has considerably improved the capability of these analyses to integrate existing knowledge and provide novel insights into the function and coordination of genes and gene networks.
Collapse
Affiliation(s)
- Pengcheng Zhu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | | | | | | | | |
Collapse
|
41
|
Zhang S, Chang Z, Li Z, DuanMu H, Li Z, Li K, Liu Y, Qiu F, Xu Y. Calculating phenotypic similarity between genes using hierarchical structure data based on semantic similarity. Gene 2012; 497:58-65. [PMID: 22305981 DOI: 10.1016/j.gene.2012.01.014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2011] [Revised: 01/16/2012] [Accepted: 01/18/2012] [Indexed: 01/25/2023]
Abstract
Phenotypic similarity is correlated with a number of measures of gene function, such as relatedness at the level of direct protein-protein interaction. The phenotypic effect of a deleted or mutated gene, which is one part of gene annotation, has caught broad attention. However, there have been few measures to study phenotypic similarity with the data from Human Phenotype Ontology (HPO) database, therefore more analogous measures should be developed and investigated. We used five semantic similarity-based measures (Jiang and Conrath, Lin, Schlicker, Yu and Wu) to calculate the human phenotypic similarity between genes (PSG) with data from HPO database, and evaluated their accuracy with information of protein-protein interaction, protein complex, protein family, gene function or DNA sequence. Compared with the gene pairs that were random selected, the results of these methods were statistically significant (all P<0.001). Furthermore, we assessed the performance of these five measures by receiver operating characteristic (ROC) curve analysis, and found that most of them performed better than the previous methods. This work had proved that these measures based on semantic similarity for calculation of PSG were effective for hierarchical structure data. Our study contributes to the development and optimization of novel algorithms of PSG calculation and provides more alternative methods to researchers as well as tools and directions for PSG study.
Collapse
Affiliation(s)
- Shanzhen Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, PR China
| | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Judson RS, Mortensen HM, Shah I, Knudsen TB, Elloumi F. Using pathway modules as targets for assay development in xenobiotic screening. ACTA ACUST UNITED AC 2012; 8:531-42. [DOI: 10.1039/c1mb05303e] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
43
|
Wei P, Pan W. Bayesian Joint Modeling of Multiple Gene Networks and Diverse Genomic Data to Identify Target Genes of a Transcription Factor. Ann Appl Stat 2012; 6:334-355. [PMID: 22408712 PMCID: PMC3298193 DOI: 10.1214/11-aoas502] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
We consider integrative modeling of multiple gene networks and diverse genomic data, including protein-DNA binding, gene expression and DNA sequence data, to accurately identify the regulatory target genes of a transcription factor (TF). Rather than treating all the genes equally and independently a priori in existing joint modeling approaches, we incorporate the biological prior knowledge that neighboring genes on a gene network tend to be (or not to be) regulated together by a TF. A key contribution of our work is that, to maximize the use of all existing biological knowledge, we allow incorporation of multiple gene networks into joint modeling of genomic data by introducing a mixture model based on the use of multiple Markov random fields (MRFs). Another important contribution of our work is to allow different genomic data to be correlated and to examine the validity and effect of the independence assumption as adopted in existing methods. Due to a fully Bayesian approach, inference about model parameters can be carried out based on MCMC samples. Application to an E. coli data set, together with simulation studies, demonstrates the utility and statistical efficiency gains with the proposed joint model.
Collapse
Affiliation(s)
- Peng Wei
- Division of Biostatistics and Human Genetics Center, University of Texas School of Public Health, Houston, TX 77030, USA,
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA,
| |
Collapse
|
44
|
Guzzi PH, Mina M, Guerra C, Cannataro M. Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform 2011; 13:569-85. [PMID: 22138322 DOI: 10.1093/bib/bbr066] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.
Collapse
|
45
|
OEHMEN CHRISTOPHERS, STRAATSMA TJERKP, ANDERSON GORDONA, ORR GALYA, WEBB-ROBERTSON BOBBIEJOM, TAYLOR RONALDC, MOONEY RYANW, BAXTER DOUGJ, JONES DONALDR, DIXON DAVIDA. NEW CHALLENGES FACING INTEGRATIVE BIOLOGICAL SCIENCE IN THE POST-GENOMIC ERA. J BIOL SYST 2011. [DOI: 10.1142/s0218339006001805] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The future of biology will be increasingly driven by the fundamental paradigm shift from hypothesis-driven research to data-driven discovery research employing the growing volume of biological data coupled to experimental testing of new discoveries. But hardware and software limitations in the current workflow infrastructure make it impossible or intractible to use real data from disparate sources for large-scale biological research. We identify key technological developments needed to enable this paradigm shift involving (1) the ability to store and manage extremely large datasets which are dispersed over a wide geographical area, (2) development of novel analysis and visualization tools which are capable of operating on enormous data resources without overwhelming researchers with unusable information, and (3) formalisms for integrating mathematical models of biosystems from the molecular level to the organism population level. This will require the development of algorithms and tools which efficiently utilize high-performance compute power and large storage infrastructures. The end result will be the ability of a researcher to integrate complex data from many different sources with simulations to analyze a given system at a wide range of temporal and spatial scales in a single conceptual model.
Collapse
Affiliation(s)
| | | | | | - GALYA ORR
- Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | | | | | - RYAN W. MOONEY
- Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - DOUG J. BAXTER
- Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - DONALD R. JONES
- Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - DAVID A. DIXON
- Department of Chemistry, University of Alabama, Tuscaloosa, AL 35487-0336, USA
| |
Collapse
|
46
|
Hendrix W, Rocha AM, Padmanabhan K, Choudhary A, Scott K, Mihelcic JR, Samatova NF. DENSE: efficient and prior knowledge-driven discovery of phenotype-associated protein functional modules. BMC SYSTEMS BIOLOGY 2011; 5:172. [PMID: 22024446 PMCID: PMC3231954 DOI: 10.1186/1752-0509-5-172] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Accepted: 10/24/2011] [Indexed: 01/09/2023]
Abstract
Background Identifying cellular subsystems that are involved in the expression of a target phenotype has been a very active research area for the past several years. In this paper, cellular subsystem refers to a group of genes (or proteins) that interact and carry out a common function in the cell. Most studies identify genes associated with a phenotype on the basis of some statistical bias, others have extended these statistical methods to analyze functional modules and biological pathways for phenotype-relatedness. However, a biologist might often have a specific question in mind while performing such analysis and most of the resulting subsystems obtained by the existing methods might be largely irrelevant to the question in hand. Arguably, it would be valuable to incorporate biologist's knowledge about the phenotype into the algorithm. This way, it is anticipated that the resulting subsytems would not only be related to the target phenotype but also contain information that the biologist is likely to be interested in. Results In this paper we introduce a fast and theoretically guranteed method called DENSE (Dense and ENriched Subgraph Enumeration) that can take in as input a biologist's prior knowledge as a set of query proteins and identify all the dense functional modules in a biological network that contain some part of the query vertices. The density (in terms of the number of network egdes) and the enrichment (the number of query proteins in the resulting functional module) can be manipulated via two parameters γ and μ, respectively. Conclusion This algorithm has been applied to the protein functional association network of Clostridium acetobutylicum ATCC 824, a hydrogen producing, acid-tolerant organism. The algorithm was able to verify relationships known to exist in literature and also some previously unknown relationships including those with regulatory and signaling functions. Additionally, we were also able to hypothesize that some uncharacterized proteins are likely associated with the target phenotype. The DENSE code can be downloaded from http://www.freescience.org/cs/DENSE/
Collapse
Affiliation(s)
- Willam Hendrix
- Department of Computer Science, North Carolina State University, Raleigh, 27695, USA
| | | | | | | | | | | | | |
Collapse
|
47
|
Díaz-Díaz N, Aguilar-Ruiz JS. GO-based functional dissimilarity of gene sets. BMC Bioinformatics 2011; 12:360. [PMID: 21884611 PMCID: PMC3248071 DOI: 10.1186/1471-2105-12-360] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2010] [Accepted: 09/01/2011] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. RESULTS To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. CONCLUSIONS Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD.
Collapse
|
48
|
Gu H, Zhu P, Jiao Y, Meng Y, Chen M. PRIN: a predicted rice interactome network. BMC Bioinformatics 2011; 12:161. [PMID: 21575196 PMCID: PMC3118165 DOI: 10.1186/1471-2105-12-161] [Citation(s) in RCA: 131] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2010] [Accepted: 05/16/2011] [Indexed: 12/22/2022] Open
Abstract
Background Protein-protein interactions play a fundamental role in elucidating the molecular mechanisms of biomolecular function, signal transductions and metabolic pathways of living organisms. Although high-throughput technologies such as yeast two-hybrid system and affinity purification followed by mass spectrometry are widely used in model organisms, the progress of protein-protein interactions detection in plants is rather slow. With this motivation, our work presents a computational approach to predict protein-protein interactions in Oryza sativa. Results To better understand the interactions of proteins in Oryza sativa, we have developed PRIN, a Predicted Rice Interactome Network. Protein-protein interaction data of PRIN are based on the interologs of six model organisms where large-scale protein-protein interaction experiments have been applied: yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans), fruit fly (Drosophila melanogaster), human (Homo sapiens), Escherichia coli K12 and Arabidopsis thaliana. With certain quality controls, altogether we obtained 76,585 non-redundant rice protein interaction pairs among 5,049 rice proteins. Further analysis showed that the topology properties of predicted rice protein interaction network are more similar to yeast than to the other 5 organisms. This may not be surprising as the interologs based on yeast contribute nearly 74% of total interactions. In addition, GO annotation, subcellular localization information and gene expression data are also mapped to our network for validation. Finally, a user-friendly web interface was developed to offer convenient database search and network visualization. Conclusions PRIN is the first well annotated protein interaction database for the important model plant Oryza sativa. It has greatly extended the current available protein-protein interaction data of rice with a computational approach, which will certainly provide further insights into rice functional genomics and systems biology. PRIN is available online at http://bis.zju.edu.cn/prin/.
Collapse
Affiliation(s)
- Haibin Gu
- Department of Bioinformatics, State Key Laboratory of Plant Physiology and Biochemistry, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | | | | | | | | |
Collapse
|
49
|
Gómez A, Cedano J, Amela I, Planas A, Piñol J, Querol E. Gene ontology function prediction in mollicutes using protein-protein association networks. BMC SYSTEMS BIOLOGY 2011; 5:49. [PMID: 21486441 PMCID: PMC3086830 DOI: 10.1186/1752-0509-5-49] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Accepted: 04/12/2011] [Indexed: 11/18/2022]
Abstract
Background Many complex systems can be represented and analysed as networks. The recent availability of large-scale datasets, has made it possible to elucidate some of the organisational principles and rules that govern their function, robustness and evolution. However, one of the main limitations in using protein-protein interactions for function prediction is the availability of interaction data, especially for Mollicutes. If we could harness predicted interactions, such as those from a Protein-Protein Association Networks (PPAN), combining several protein-protein network function-inference methods with semantic similarity calculations, the use of protein-protein interactions for functional inference in this species would become more potentially useful. Results In this work we show that using PPAN data combined with other approximations, such as functional module detection, orthology exploitation methods and Gene Ontology (GO)-based information measures helps to predict protein function in Mycoplasma genitalium. Conclusions To our knowledge, the proposed method is the first that combines functional module detection among species, exploiting an orthology procedure and using information theory-based GO semantic similarity in PPAN of the Mycoplasma species. The results of an evaluation show a higher recall than previously reported methods that focused on only one organism network.
Collapse
Affiliation(s)
- Antonio Gómez
- Institut de Biotecnologia i Biomedicina and Departament de Bioquímica i Biologia Molecular, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain
| | | | | | | | | | | |
Collapse
|
50
|
Chen Y, Mao F, Li G, Xu Y. Genome-wide discovery of missing genes in biological pathways of prokaryotes. BMC Bioinformatics 2011; 12 Suppl 1:S1. [PMID: 21342538 PMCID: PMC3044263 DOI: 10.1186/1471-2105-12-s1-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Abstract
Collapse
Affiliation(s)
- Yong Chen
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
| | | | | | | |
Collapse
|