1
|
Chundru VK, Zhang Z, Walter K, Lindsay SJ, Danecek P, Eberhardt RY, Gardner EJ, Malawsky DS, Wigdor EM, Torene R, Retterer K, Wright CF, Ólafsdóttir H, Guillen Sacoto MJ, Ayaz A, Akbeyaz IH, Türkdoğan D, Al Balushi AI, Bertoli-Avella A, Bauer P, Szenker-Ravi E, Reversade B, McWalter K, Sheridan E, Firth HV, Hurles ME, Samocha KE, Ustach VD, Martin HC. Federated analysis of autosomal recessive coding variants in 29,745 developmental disorder patients from diverse populations. Nat Genet 2024; 56:2046-2053. [PMID: 39313616 PMCID: PMC11525179 DOI: 10.1038/s41588-024-01910-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 08/14/2024] [Indexed: 09/25/2024]
Abstract
Autosomal recessive coding variants are well-known causes of rare disorders. We quantified the contribution of these variants to developmental disorders in a large, ancestrally diverse cohort comprising 29,745 trios, of whom 20.4% had genetically inferred non-European ancestries. The estimated fraction of patients attributable to exome-wide autosomal recessive coding variants ranged from ~2-19% across genetically inferred ancestry groups and was significantly correlated with average autozygosity. Established autosomal recessive developmental disorder-associated (ARDD) genes explained 84.0% of the total autosomal recessive coding burden, and 34.4% of the burden in these established genes was explained by variants not already reported as pathogenic in ClinVar. Statistical analyses identified two novel ARDD genes: KBTBD2 and ZDHHC16. This study expands our understanding of the genetic architecture of developmental disorders across diverse genetically inferred ancestry groups and suggests that improving strategies for interpreting missense variants in known ARDD genes may help diagnose more patients than discovering the remaining genes.
Collapse
Affiliation(s)
- V Kartik Chundru
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Department of Clinical and Biomedical Sciences, University of Exeter Medical School, Royal Devon and Exeter Hospital, Exeter, UK
| | - Zhancheng Zhang
- GeneDx, Gaithersburg, MD, USA
- Deka Biosciences, Germantown, MD, USA
| | - Klaudia Walter
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Sarah J Lindsay
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Petr Danecek
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Eugene J Gardner
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- MRC Epidemiology Unit, Cambridge, UK
| | | | - Emilie M Wigdor
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Institute of Developmental and Regenerative Medicine, Department of Paediatrics, University of Oxford, Oxford, UK
| | - Rebecca Torene
- GeneDx, Gaithersburg, MD, USA
- Geisinger, Danville, PA, USA
| | - Kyle Retterer
- GeneDx, Gaithersburg, MD, USA
- Geisinger, Danville, PA, USA
| | - Caroline F Wright
- Department of Clinical and Biomedical Sciences, University of Exeter Medical School, Royal Devon and Exeter Hospital, Exeter, UK
| | | | | | - Akif Ayaz
- Istanbul Medipol University, Medical School, Department of Medical Genetics, Istanbul, Turkey
| | - Ismail Hakki Akbeyaz
- Marmara University Medical Faculty, Pendik Training and Research Hospital, Department of Pediatric Neurology, Istanbul, Turkey
| | - Dilşad Türkdoğan
- Marmara University Medical Faculty, Pendik Training and Research Hospital, Department of Pediatric Neurology, Istanbul, Turkey
| | | | | | - Peter Bauer
- Medical Genetics, CENTOGENE GmbH, Rostock, Germany
- Clinic of Internal Medicine, Department of Hematology, Oncology, and Palliative Medicine, University Medicine Rostock, Rostock, Germany
| | | | - Bruno Reversade
- Laboratory of Human Genetics & Therapeutics, BESE, KAUST, Thuwal, Saudi Arabia
| | | | - Eamonn Sheridan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Leeds Institute of Medical Research, University of Leeds, St. James's University Hospital, Leeds, UK
- Yorkshire Regional Genetics Service, Chapel Allerton Hospital, Leeds, UK
| | - Helen V Firth
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Cambridge University Hospitals Foundation Trust, Addenbrooke's Hospital, Cambridge, UK
| | | | - Kaitlin E Samocha
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | | | - Hilary C Martin
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
| |
Collapse
|
2
|
Koutsandreas T, Felden B, Chevet E, Chatziioannou A. Protein homeostasis imprinting across evolution. NAR Genom Bioinform 2024; 6:lqae014. [PMID: 38486886 PMCID: PMC10939379 DOI: 10.1093/nargab/lqae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 10/07/2023] [Accepted: 01/24/2024] [Indexed: 03/17/2024] Open
Abstract
Protein homeostasis (a.k.a. proteostasis) is associated with the primary functions of life, and therefore with evolution. However, it is unclear how cellular proteostasis machines have evolved to adjust protein biogenesis needs to environmental constraints. Herein, we describe a novel computational approach, based on semantic network analysis, to evaluate proteostasis plasticity during evolution. We show that the molecular components of the proteostasis network (PN) are reliable metrics to deconvolute the life forms into Archaea, Bacteria and Eukarya and to assess the evolution rates among species. Semantic graphs were used as new criteria to evaluate PN complexity in 93 Eukarya, 250 Bacteria and 62 Archaea, thus representing a novel strategy for taxonomic classification, which provided information about species divergence. Kingdom-specific PN components were identified, suggesting that PN complexity may correlate with evolution. We found that the gains that occurred throughout PN evolution revealed a dichotomy within both the PN conserved modules and within kingdom-specific modules. Additionally, many of these components contribute to the evolutionary imprinting of other conserved mechanisms. Finally, the current study suggests a new way to exploit the genomic annotation of biomedical ontologies, deriving new knowledge from the semantic comparison of different biological systems.
Collapse
Affiliation(s)
- Thodoris Koutsandreas
- Center of Systems Biology, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
- e-NIOS Applications PC, Kallithea-Athens, Greece
| | - Brice Felden
- University of Rennes, INSERM U1230, Rennes, France
| | - Eric Chevet
- INSERM U1242, University of Rennes, Rennes, France
- Centre de Lutte Contre le Cancer Eugène Marquis, Rennes, France
| | - Aristotelis Chatziioannou
- Center of Systems Biology, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
- e-NIOS Applications PC, Kallithea-Athens, Greece
| |
Collapse
|
3
|
Muniyappan S, Rayan AXA, Varrieth GT. DTiGNN: Learning drug-target embedding from a heterogeneous biological network based on a two-level attention-based graph neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:9530-9571. [PMID: 37161255 DOI: 10.3934/mbe.2023419] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
MOTIVATION In vitro experiment-based drug-target interaction (DTI) exploration demands more human, financial and data resources. In silico approaches have been recommended for predicting DTIs to reduce time and cost. During the drug development process, one can analyze the therapeutic effect of the drug for a particular disease by identifying how the drug binds to the target for treating that disease. Hence, DTI plays a major role in drug discovery. Many computational methods have been developed for DTI prediction. However, the existing methods have limitations in terms of capturing the interactions via multiple semantics between drug and target nodes in a heterogeneous biological network (HBN). METHODS In this paper, we propose a DTiGNN framework for identifying unknown drug-target pairs. The DTiGNN first calculates the similarity between the drug and target from multiple perspectives. Then, the features of drugs and targets from each perspective are learned separately by using a novel method termed an information entropy-based random walk. Next, all of the learned features from different perspectives are integrated into a single drug and target similarity network by using a multi-view convolutional neural network. Using the integrated similarity networks, drug interactions, drug-disease associations, protein interactions and protein-disease association, the HBN is constructed. Next, a novel embedding algorithm called a meta-graph guided graph neural network is used to learn the embedding of drugs and targets. Then, a convolutional neural network is employed to infer new DTIs after balancing the sample using oversampling techniques. RESULTS The DTiGNN is applied to various datasets, and the result shows better performance in terms of the area under receiver operating characteristic curve (AUC) and area under precision-recall curve (AUPR), with scores of 0.98 and 0.99, respectively. There are 23,739 newly predicted DTI pairs in total.
Collapse
Affiliation(s)
- Saranya Muniyappan
- Computer Science and Engineering, CEG Campus, Anna University, Tamil Nadu, India
| | | | | |
Collapse
|
4
|
Zhang J, Zhu M, Qian Y. protein2vec: Predicting Protein-Protein Interactions Based on LSTM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1257-1266. [PMID: 32750870 DOI: 10.1109/tcbb.2020.3003941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The semantic similarity of gene ontology (GO) terms is widely used to predict protein-protein interactions (PPIs). The traditional semantic similarity measures are based mainly on manually crafted features, which may ignore some important hidden information of the gene ontology. Moreover, those methods usually obtain the similarity between proteins from similarity between GO terms by some simple statistical rules, such as MAX and BMA (best-match average), oversimplifying the possible complex relationship between the proteins and the GO terms annotated with them. To overcome the two deficiencies, we propose a new method named protein2vec, which characterizes a protein with a vector based on the GO terms annotated to it and combines the information of both the GO and known PPIs. We firstly try to apply the network embedding algorithm on the GO network to generate feature vectors for each GO term. Then, Long Short-Time Memory (LSTM) encodes the feature vectors of the GO terms annotated with a protein into another vector (called protein vector). Finally, two protein vectors are forwarded into a feedforward neural network to predict the interaction between the two corresponding proteins. The experimental results show that protein2vec outperforms almost all commonly used traditional semantic similarity methods.
Collapse
|
5
|
Mallick K, Mallik S, Bandyopadhyay S, Chakraborty S. A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:773-785. [PMID: 32866101 DOI: 10.1109/tcbb.2020.3020537] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Large scale multi-omics data analysis and signature prediction have been a topic of interest in the last two decades. While various traditional clustering/correlation-based methods have been proposed, but the overall prediction is not always satisfactory. To solve these challenges, in this article, we propose a new approach by leveraging the Gene Ontology (GO)similarity combined with multiomics data. In this article, a new GO similarity measure, ModSchlicker, is proposed and the effectiveness of the proposed measure along with other standardized measures are reviewed while using various graph topology-based Information Content (IC)values of GO-term. The proposed measure is deployed to PPI prediction. Furthermore, by involving GO similarity, we propose a new framework for stronger disease-based gene signature detection from the multi-omics data. For the first objective, we predict interaction from various benchmark PPI datasets of Yeast and Human species. For the latter, the gene expression and methylation profiles are used to identify Differentially Expressed and Methylated (DEM)genes. Thereafter, the GO similarity score along with a statistical method are used to determine the potential gene signature. Interestingly, the proposed method produces a better performance ( 0.9 avg. accuracy and 0.95 AUC)as compared to the other existing related methods during the classification of the participating features (genes)of the signature. Moreover, the proposed method is highly useful in other prediction/classification problems for any kind of large scale omics data.
Collapse
|
6
|
Tenekeci S, Isik Z. Integrative Biological Network Analysis to Identify Shared Genes in Metabolic Disorders. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:522-530. [PMID: 32396100 DOI: 10.1109/tcbb.2020.2993301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identification of common molecular mechanisms in interrelated diseases is essential for better prognoses and targeted therapies. However, complexity of metabolic pathways makes it difficult to discover common disease genes underlying metabolic disorders; and it requires more sophisticated bioinformatics models that combine different types of biological data and computational methods. Accordingly, we built an integrative network analysis model to identify shared disease genes in metabolic syndrome (MS), type 2 diabetes (T2D), and coronary artery disease (CAD). We constructed weighted gene co-expression networks by combining gene expression, protein-protein interaction, and gene ontology data from multiple sources. For 90 different configurations of disease networks, we detected the significant modules by using MCL, SPICi, and Linkcomm graph clustering algorithms. We also performed a comparative evaluation on disease modules to determine the best method providing the highest biological validity. By overlapping the disease modules, we identified 22 shared genes for MS-CAD and T2D-CAD. Moreover, 19 out of these genes were directly or indirectly associated with relevant diseases in the previous medical studies. This study does not only demonstrate the performance of different biological data sources and computational methods in disease-gene discovery, but also offers potential insights into common genetic mechanisms of the metabolic disorders.
Collapse
|
7
|
Pesaranghader A, Matwin S, Sokolova M, Grenier JC, Beiko RG, Hussin J. OUP accepted manuscript. Bioinformatics 2022; 38:3051-3061. [PMID: 35536192 PMCID: PMC9154256 DOI: 10.1093/bioinformatics/btac304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 02/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation There is a plethora of measures to evaluate functional similarity (FS) of genes based on their co-expression, protein–protein interactions and sequence similarity. These measures are typically derived from hand-engineered and application-specific metrics to quantify the degree of shared information between two genes using their Gene Ontology (GO) annotations. Results We introduce deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF’s key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products and then calculate FS using these learned vectors. We show that deepSimDEF can predict the FS of new genes using their annotations: it outperformed all other FS measures by >5–10% on yeast and human reference datasets on protein–protein interactions, gene co-expression and sequence homology tasks. Thus, deepSimDEF offers a powerful and adaptable deep neural architecture that can benefit a wide range of problems in genomics and proteomics, and its architecture is flexible enough to support its extension to any organism. Availability and implementation Source code and data are available at https://github.com/ahmadpgh/deepSimDEF Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa K1H 8M5, Canada
| | | | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
| | | |
Collapse
|
8
|
Paul M, Anand A. A New Family of Similarity Measures for Scoring Confidence of Protein Interactions Using Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:19-30. [PMID: 34029194 DOI: 10.1109/tcbb.2021.3083150] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The large-scale protein-protein interaction (PPI) data has the potential to play a significant role in the endeavor of understanding cellular processes. However, the presence of a considerable fraction of false positives is a bottleneck in realizing this potential. There have been continuous efforts to utilize complementary resources for scoring confidence of PPIs in a manner that false positive interactions get a low confidence score. Gene Ontology (GO), a taxonomy of biological terms to represent the properties of gene products and their relations, has been widely used for this purpose. We utilize GO to introduce a new set of specificity measures: Relative Depth Specificity (RDS), Relative Node-based Specificity (RNS), and Relative Edge-based Specificity (RES), leading to a new family of similarity measures. We use these similarity measures to obtain a confidence score for each PPI. We evaluate the new measures using four different benchmarks. We show that all the three measures are quite effective. Notably, RNS and RES more effectively distinguish true PPIs from false positives than the existing alternatives. RES also shows a robust set-discriminating power and can be useful for protein functional clustering as well.
Collapse
|
9
|
Liu J, Zhu H, Qiu J. Locally Adjust Networks Based on Connectivity and Semantic Similarities for Disease Module Detection. Front Genet 2021; 12:726596. [PMID: 34759955 PMCID: PMC8575408 DOI: 10.3389/fgene.2021.726596] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 09/22/2021] [Indexed: 11/13/2022] Open
Abstract
For studying the pathogenesis of complex diseases, it is important to identify the disease modules in the system level. Since the protein-protein interaction (PPI) networks contain a number of incomplete and incorrect interactome, most existing methods often lead to many disease proteins isolating from disease modules. In this paper, we propose an effective disease module identification method IDMCSS, where the used human PPI networks are obtained by adding some potential missing interactions from existing PPI networks, as well as removing some potential incorrect interactions. In IDMCSS, a network adjustment strategy is developed to add or remove links around disease proteins based on both topological and semantic information. Next, neighboring proteins of disease proteins are prioritized according to a suggested similarity between each of them and disease proteins, and the protein with the largest similarity with disease proteins is added into a candidate disease protein set one by one. The stopping criterion is set to the boundary of the disease proteins. Finally, the connected subnetwork having the largest number of disease proteins is selected as a disease module. Experimental results on asthma demonstrate the effectiveness of the method in comparison to existing algorithms for disease module identification. It is also shown that the proposed IDMCSS can obtain the disease modules having crucial biological processes of asthma and 12 targets for drug intervention can be predicted.
Collapse
Affiliation(s)
- Jia Liu
- State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
| | - Huole Zhu
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, China
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, Hefei, China
| | - Jianfeng Qiu
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, China
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
10
|
Kaplanis J, Samocha KE, Wiel L, Zhang Z, Arvai KJ, Eberhardt RY, Gallone G, Lelieveld SH, Martin HC, McRae JF, Short PJ, Torene RI, de Boer E, Danecek P, Gardner EJ, Huang N, Lord J, Martincorena I, Pfundt R, Reijnders MRF, Yeung A, Yntema HG, Vissers LELM, Juusola J, Wright CF, Brunner HG, Firth HV, FitzPatrick DR, Barrett JC, Hurles ME, Gilissen C, Retterer K. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 2020; 586:757-762. [PMID: 33057194 PMCID: PMC7116826 DOI: 10.1038/s41586-020-2832-5] [Citation(s) in RCA: 371] [Impact Index Per Article: 74.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 07/17/2020] [Indexed: 01/28/2023]
Abstract
De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.
Collapse
Affiliation(s)
- Joanna Kaplanis
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Kaitlin E Samocha
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Laurens Wiel
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | | | - Ruth Y Eberhardt
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Giuseppe Gallone
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Stefan H Lelieveld
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Hilary C Martin
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Jeremy F McRae
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Patrick J Short
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Elke de Boer
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Petr Danecek
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Eugene J Gardner
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Ni Huang
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Jenny Lord
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, UK
| | - Iñigo Martincorena
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Rolph Pfundt
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Margot R F Reijnders
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
- Department of Clinical Genetics, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Alison Yeung
- Victorian Clinical Genetics Services, Melbourne, Victoria, Australia
- Murdoch Children's Research Institute, Melbourne, Victoria, Australia
| | - Helger G Yntema
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Lisenka E L M Vissers
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | - Caroline F Wright
- Institute of Biomedical and Clinical Science, University of Exeter Medical School, Royal Devon & Exeter Hospital, Exeter, UK
| | - Han G Brunner
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
- Department of Clinical Genetics, Maastricht University Medical Centre, Maastricht, The Netherlands
- GROW School for Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands
- MHENS School for Mental Health and Neuroscience, Maastricht University Medical Centre, Maastricht, The Netherlands
| | - Helen V Firth
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- East Anglian Medical Genetics Service, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - David R FitzPatrick
- MRC Human Genetics Unit, MRC IGMM, University of Edinburgh, Western General Hospital, Edinburgh, UK
| | - Jeffrey C Barrett
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Matthew E Hurles
- Human Genetics Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
| | - Christian Gilissen
- Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | | |
Collapse
|
11
|
Ikram N, Qadir MA, Afzal MT. SimExact – An Efficient Method to Compute Function Similarity Between Proteins Using Gene Ontology. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017092842] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The rapidly growing protein and annotation databases necessitate the development
of efficient tools to process this valuable information. Biologists frequently need to
find proteins similar to a given protein, for which BLAST tools are commonly used. With the development
of biomedical ontologies, e.g. Gene Ontology, methods were designed to measure
function (semantic) similarity between two proteins. These methods work well on protein pairs,
but are not suitable for protein query processing.
Objective:
Our aim is to facilitate searching of similar proteins in an acceptable time.
Methods:
A novel method SimExact for high speed searching of functionally similar proteins has
been proposed.
Results:
The experiments of this study show that SimExact gives correct results required for protein
searching. A fully functional prototype of an online tool (www.datafurnish.com/protsem.php)
has been provided that generates a ranked list of the proteins similar to a query protein, with a response
time of less than 20 seconds in our setup. SimExact was used to search for protein pairs
having high disparity between function similarity and sequence similarity.
Conclusion:
SimExact makes such searches practical, which would not be possible in a reasonable
time otherwise.
Collapse
Affiliation(s)
- Najmul Ikram
- COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan
| | - Muhammad Abdul Qadir
- Center for Distributed and Semantic Computing, Capital University of Science and Technology, Islamabad, Pakistan
| | - Muhammad Tanvir Afzal
- Center for Distributed and Semantic Computing, Capital University of Science and Technology, Islamabad, Pakistan
| |
Collapse
|
12
|
Khorsand B, Savadi A, Zahiri J, Naghibzadeh M. Alpha influenza virus infiltration prediction using virus-human protein-protein interaction network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2020; 17:3109-3129. [PMID: 32987519 DOI: 10.3934/mbe.2020176] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
More than ten million deaths make influenza virus one of the deadliest of history. About half a million sever illnesses are annually reported consequent of influenza. Influenza is a parasite which needs the host cellular machinery to replicate its genome. To reach the host, viral proteins need to interact with the host proteins. Therefore, identification of host-virus protein interaction network (HVIN) is one of the crucial steps in treating viral diseases. Being expensive, time-consuming and laborious of HVIN experimental identification, force the researches to use computational methods instead of experimental ones to obtain a better understanding of HVIN. In this study, several features are extracted from physicochemical properties of amino acids, combined with different centralities of human protein-protein interaction network (HPPIN) to predict protein-protein interactions between human proteins and Alphainfluenzavirus proteins (HI-PPIs). Ensemble learning methods were used to predict such PPIs. Our model reached 0.93 accuracy, 0.91 sensitivity and 0.95 specificity. Moreover, a database including 694522 new PPIs was constructed by prediction results of the model. Further analysis showed that HPPIN centralities, gene ontology semantic similarity and conjoint triad of virus proteins are the most important features to predict HI-PPIs.
Collapse
Affiliation(s)
- Babak Khorsand
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Abdorreza Savadi
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Javad Zahiri
- Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mahmoud Naghibzadeh
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
13
|
Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:6750296. [PMID: 30809545 PMCID: PMC6369486 DOI: 10.1155/2019/6750296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Accepted: 01/13/2019] [Indexed: 11/30/2022]
Abstract
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
Collapse
|
14
|
Shen C, Ding Y, Tang J, Guo F. Multivariate Information Fusion With Fast Kernel Learning to Kernel Ridge Regression in Predicting LncRNA-Protein Interactions. Front Genet 2019; 9:716. [PMID: 30697228 PMCID: PMC6340980 DOI: 10.3389/fgene.2018.00716] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Accepted: 12/21/2018] [Indexed: 12/31/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) constitute a large class of transcribed RNA molecules. They have a characteristic length of more than 200 nucleotides which do not encode proteins. They play an important role in regulating gene expression by interacting with the homologous RNA-binding proteins. Due to the laborious and time-consuming nature of wet experimental methods, more researchers should pay great attention to computational approaches for the prediction of lncRNA-protein interaction (LPI). An in-depth literature review in the state-of-the-art in silico investigations, leads to the conclusion that there is still room for improving the accuracy and velocity. This paper propose a novel method for identifying LPI by employing Kernel Ridge Regression, based on Fast Kernel Learning (LPI-FKLKRR). This approach, uses four distinct similarity measures for lncRNA and protein space, respectively. It is remarkable, that we extract Gene Ontology (GO) with proteins, in order to improve the quality of information in protein space. The process of heterogeneous kernels integration, applies Fast Kernel Learning (FastKL) to deal with weight optimization. The extrapolation model is obtained by gaining the ultimate prediction associations, after using Kernel Ridge Regression (KRR). Experimental outcomes show that the ability of modeling with LPI-FKLKRR has extraordinary performance compared with LPI prediction schemes. On benchmark dataset, it has been observed that the best Area Under Precision Recall Curve (AUPR) of 0.6950 is obtained by our proposed model LPI-FKLKRR, which outperforms the integrated LPLNP (AUPR: 0.4584), RWR (AUPR: 0.2827), CF (AUPR: 0.2357), LPIHN (AUPR: 0.2299), and LPBNI (AUPR: 0.3302). Also, combined with the experimental results of a case study on a novel dataset, it is anticipated that LPI-FKLKRR will be a useful tool for LPI prediction.
Collapse
Affiliation(s)
- Cong Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
15
|
Hadarovich A, Anishchenko I, Tuzikov AV, Kundrotas PJ, Vakser IA. Gene ontology improves template selection in comparative protein docking. Proteins 2018; 87:245-253. [PMID: 30520123 DOI: 10.1002/prot.25645] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2018] [Revised: 10/21/2018] [Accepted: 11/29/2018] [Indexed: 02/06/2023]
Abstract
Structural characterization of protein-protein interactions is essential for our ability to study life processes at the molecular level. Computational modeling of protein complexes (protein docking) is important as the source of their structure and as a way to understand the principles of protein interaction. Rapidly evolving comparative docking approaches utilize target/template similarity metrics, which are often based on the protein structure. Although the structural similarity, generally, yields good performance, other characteristics of the interacting proteins (eg, function, biological process, and localization) may improve the prediction quality, especially in the case of weak target/template structural similarity. For the ranking of a pool of models for each target, we tested scoring functions that quantify similarity of Gene Ontology (GO) terms assigned to target and template proteins in three ontology domains-biological process, molecular function, and cellular component (GO-score). The scoring functions were tested in docking of bound, unbound, and modeled proteins. The results indicate that the combined structural and GO-terms functions improve the scoring, especially in the twilight zone of structural similarity, typical for protein models of limited accuracy.
Collapse
Affiliation(s)
- Anna Hadarovich
- Computational Biology Program, The University of Kansas, Lawrence, Kansas.,United Institute of Informatics Problems, National Academy of Sciences, Minsk, Belarus
| | - Ivan Anishchenko
- Computational Biology Program, The University of Kansas, Lawrence, Kansas
| | - Alexander V Tuzikov
- United Institute of Informatics Problems, National Academy of Sciences, Minsk, Belarus
| | - Petras J Kundrotas
- Computational Biology Program, The University of Kansas, Lawrence, Kansas
| | - Ilya A Vakser
- Computational Biology Program, The University of Kansas, Lawrence, Kansas.,Department of Molecular Biosciences, The University of Kansas, Kansas, Lawrence
| |
Collapse
|
16
|
GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms. Sci Rep 2018; 8:15107. [PMID: 30305653 PMCID: PMC6180005 DOI: 10.1038/s41598-018-33219-y] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 09/24/2018] [Indexed: 01/29/2023] Open
Abstract
Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik’s and Wang’s methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/.
Collapse
|
17
|
Ding Z, Kihara D. Computational Methods for Predicting Protein-Protein Interactions Using Various Protein Features. CURRENT PROTOCOLS IN PROTEIN SCIENCE 2018; 93:e62. [PMID: 29927082 PMCID: PMC6097941 DOI: 10.1002/cpps.62] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Understanding protein-protein interactions (PPIs) in a cell is essential for learning protein functions, pathways, and mechanism of diseases. PPIs are also important targets for developing drugs. Experimental methods, both small-scale and large-scale, have identified PPIs in several model organisms. However, results cover only a part of PPIs of organisms; moreover, there are many organisms whose PPIs have not yet been investigated. To complement experimental methods, many computational methods have been developed that predict PPIs from various characteristics of proteins. Here we provide an overview of literature reports to classify computational PPI prediction methods that consider different features of proteins, including protein sequence, genomes, protein structure, function, PPI network topology, and those which integrate multiple methods. © 2018 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Ziyun Ding
- Department of Biological Science, Purdue University, West Lafayette, IN, 47907 USA
| | - Daisuke Kihara
- Department of Biological Science, Purdue University, West Lafayette, IN, 47907 USA
- Department of Computer Science, Purdue University, West Lafayette, IN, 47907 USA
- Corresponding author: DK; , Phone: 1-765-496-2284 (DK)
| |
Collapse
|
18
|
Zhang J, Jia K, Jia J, Qian Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinformatics 2018; 19:161. [PMID: 29699476 PMCID: PMC5921294 DOI: 10.1186/s12859-018-2152-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 04/09/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Comparing and classifying functions of gene products are important in today's biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. RESULTS We propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by "is_a" and "part_of" relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. CONCLUSIONS HVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM .
Collapse
Affiliation(s)
- Jiongmin Zhang
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Ke Jia
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Jinmeng Jia
- School of life science, East China Normal University, Dongchuan Road, Shanghai, 200241 China
| | - Ying Qian
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| |
Collapse
|
19
|
Peng J, Zhang X, Hui W, Lu J, Li Q, Liu S, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC SYSTEMS BIOLOGY 2018; 12:18. [PMID: 29560823 PMCID: PMC5861498 DOI: 10.1186/s12918-018-0539-0] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
BACKGROUND Gene Ontology (GO) is one of the most popular bioinformatics resources. In the past decade, Gene Ontology-based gene semantic similarity has been effectively used to model gene-to-gene interactions in multiple research areas. However, most existing semantic similarity approaches rely only on GO annotations and structure, or incorporate only local interactions in the co-functional network. This may lead to inaccurate GO-based similarity resulting from the incomplete GO topology structure and gene annotations. RESULTS We present NETSIM2, a new network-based method that allows researchers to measure GO-based gene functional similarities by considering the global structure of the co-functional network with a random walk with restart (RWR)-based method, and by selecting the significant term pairs to decrease the noise information. Based on the EC number (Enzyme Commission)-based groups of yeast and Arabidopsis, evaluation test shows that NETSIM2 can enhance the accuracy of Gene Ontology-based gene functional similarity. CONCLUSIONS Using NETSIM2 as an example, we found that the accuracy of semantic similarities can be significantly improved after effectively incorporating the global gene-to-gene interactions in the co-functional network, especially on the species that gene annotations in GO are far from complete.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China. .,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, China. .,Centre for Multidisciplinary Convergence Computing (CMCC), School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Xuanshuo Zhang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Qianqian Li
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Shuhui Liu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.,Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi'an, China
| |
Collapse
|
20
|
Fusing multiple protein-protein similarity networks to effectively predict lncRNA-protein interactions. BMC Bioinformatics 2017; 18:420. [PMID: 29072138 PMCID: PMC5657051 DOI: 10.1186/s12859-017-1819-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Long non-coding RNA (lncRNA) plays important roles in many biological and pathological processes, including transcriptional regulation and gene regulation. As lncRNA interacts with multiple proteins, predicting lncRNA-protein interactions (lncRPIs) is an important way to study the functions of lncRNA. Up to now, there have been a few works that exploit protein-protein interactions (PPIs) to help the prediction of new lncRPIs. RESULTS In this paper, we propose to boost the prediction of lncRPIs by fusing multiple protein-protein similarity networks (PPSNs). Concretely, we first construct four PPSNs based on protein sequences, protein domains, protein GO terms and the STRING database respectively, then build a more informative PPSN by fusing these four constructed PPSNs. Finally, we predict new lncRPIs by a random walk method with the fused PPSN and known lncRPIs. Our experimental results show that the new approach outperforms the existing methods. CONCLUSION Fusing multiple protein-protein similarity networks can effectively boost the performance of predicting lncRPIs.
Collapse
|
21
|
Lastra-Díaz JJ, García-Serrano A, Batet M, Fernández M, Chirigati F. HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. INFORM SYST 2017. [DOI: 10.1016/j.is.2017.02.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
22
|
Exploring Approaches for Detecting Protein Functional Similarity within an Orthology-based Framework. Sci Rep 2017; 7:381. [PMID: 28336965 PMCID: PMC5428484 DOI: 10.1038/s41598-017-00465-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 02/28/2017] [Indexed: 11/21/2022] Open
Abstract
Protein functional similarity based on gene ontology (GO) annotations serves as a powerful tool when comparing proteins on a functional level in applications such as protein-protein interaction prediction, gene prioritization, and disease gene discovery. Functional similarity (FS) is usually quantified by combining the GO hierarchy with an annotation corpus that links genes and gene products to GO terms. One large group of algorithms involves calculation of GO term semantic similarity (SS) between all the terms annotating the two proteins, followed by a second step, described as “mixing strategy”, which involves combining the SS values to yield the final FS value. Due to the variability of protein annotation caused e.g. by annotation bias, this value cannot be reliably compared on an absolute scale. We therefore introduce a similarity z-score that takes into account the FS background distribution of each protein. For a selection of popular SS measures and mixing strategies we demonstrate moderate accuracy improvement when using z-scores in a benchmark that aims to separate orthologous cases from random gene pairs and discuss in this context the impact of annotation corpus choice. The approach has been implemented in Frela, a fast high-throughput public web server for protein FS calculation and interpretation.
Collapse
|
23
|
Shui Y, Cho YR. Alignment of PPI Networks Using Semantic Similarity for Conserved Protein Complex Prediction. IEEE Trans Nanobioscience 2017; 15:380-389. [PMID: 28113907 DOI: 10.1109/tnb.2016.2555802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Network alignment is a computational technique to identify topological similarity of graph data by mapping link patterns. In bioinformatics, network alignment algorithms have been applied to protein-protein interaction (PPI) networks to discover evolutionarily conserved substructures at the system level. In particular, local network alignment of PPI networks searches for conserved functional components between species and predicts unknown protein complexes and signaling pathways. In this article, we present a novel approach of local network alignment by semantic mapping. While most previous methods find protein matches between species by sequence homology, our approach uses semantic similarity. Given Gene Ontology (GO) and its annotation data, we estimate functional closeness between two proteins by measuring their semantic similarity. We adopted a new semantic similarity measure, simVICD, which has the best performance for PPI validation and functional match. We tested alignment between the PPI networks of well-studied yeast protein complexes and the genome-wide PPI network of human in order to predict human protein complexes. The experimental results demonstrate that our approach has higher accuracy in protein complex prediction than graph clustering algorithms, and higher efficiency than previous network alignment algorithms.
Collapse
|
24
|
Dai H, Liu Q, Liu B. Research Progress on Mechanism of Podocyte Depletion in Diabetic Nephropathy. J Diabetes Res 2017; 2017:2615286. [PMID: 28791309 PMCID: PMC5534294 DOI: 10.1155/2017/2615286] [Citation(s) in RCA: 189] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/23/2016] [Revised: 02/05/2017] [Accepted: 03/05/2017] [Indexed: 12/13/2022] Open
Abstract
Diabetic nephropathy (DN) together with glomerular hyperfiltration has been implicated in the development of diabetic microangiopathy in the initial stage of diabetic diseases. Increased amounts of urinary protein in DN may be associated with functional and morphological alterations of podocyte, mainly including podocyte hypertrophy, epithelial-mesenchymal transdifferentiation (EMT), podocyte detachment, and podocyte apoptosis. Accumulating studies have revealed that disruption in multiple renal signaling pathways had been critical in the progression of these pathological damages, such as adenosine monophosphate-activated kinase signaling pathways (AMPK), wnt/β-catenin signaling pathways, endoplasmic reticulum stress-related signaling pathways, mammalian target of rapamycin (mTOR)/autophagy pathway, and Rho GTPases. In this review, we highlight new molecular insights underlying podocyte injury in the progression of DN, which offer new therapeutic targets to develop important renoprotective treatments for DN over the next decade.
Collapse
Affiliation(s)
- Haoran Dai
- Department of Nephrology, Shunyi Branch, Beijing Hospital of Traditional Chinese Medicine, Station East 5, Shunyi District, Beijing 101300, China
| | - Qingquan Liu
- Department of Nephrology, Shunyi Branch, Beijing Hospital of Traditional Chinese Medicine, Station East 5, Shunyi District, Beijing 101300, China
- Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital Medical University, 23 Meishuguanhou Street, Dongcheng District, Beijing 100010, China
- *Qingquan Liu: and
| | - Baoli Liu
- Department of Nephrology, Shunyi Branch, Beijing Hospital of Traditional Chinese Medicine, Station East 5, Shunyi District, Beijing 101300, China
- Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital Medical University, 23 Meishuguanhou Street, Dongcheng District, Beijing 100010, China
- *Baoli Liu:
| |
Collapse
|
25
|
Abstract
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA.
| | - Rebecca Davidson
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Eyal Akiva
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| |
Collapse
|
26
|
Abstract
Gene Ontology-based semantic similarity (SS) allows the comparison of GO terms or entities annotated with GO terms, by leveraging on the ontology structure and properties and on annotation corpora. In the last decade the number and diversity of SS measures based on GO has grown considerably, and their application ranges from functional coherence evaluation, protein interaction prediction, and disease gene prioritization.Understanding how SS measures work, what issues can affect their performance and how they compare to each other in different evaluation settings is crucial to gain a comprehensive view of this area and choose the most appropriate approaches for a given application.In this chapter, we provide a guide to understanding and selecting SS measures for biomedical researchers. We present a straightforward categorization of SS measures and describe the main strategies they employ. We discuss the intrinsic and external issues that affect their performance, and how these can be addressed. We summarize comparative assessment studies, highlighting the top measures in different settings, and compare different implementation strategies and their use. Finally, we discuss some of the extant challenges and opportunities, namely the increased semantic complexity of GO and the need for fast and efficient computation, pointing the way towards the future generation of SS measures.
Collapse
Affiliation(s)
- Catia Pesquita
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Edifício C6, Piso 3, Campo Grande, 1749-016, Lisbon, Portugal.
| |
Collapse
|
27
|
Abstract
Background Regulation mechanisms between miRNAs and genes are complicated. To accomplish a biological function, a miRNA may regulate multiple target genes, and similarly a target gene may be regulated by multiple miRNAs. Wet-lab knowledge of co-regulating miRNAs is limited. This work introduces a computational method to group miRNAs of similar functions to identify co-regulating miRNAsfrom a similarity matrix of miRNAs. Results We define a novel information content of gene ontology (GO) to measure similarity between two sets of GO graphs corresponding to the two sets of target genes of two miRNAs. This between-graph similarity is then transferred as a functional similarity between the two miRNAs. Our definition of the information content is based on the size of a GO term’s descendants, but adjusted by a weight derived from its depth level and the GO relationships at its path to the root node or to the most informative common ancestor (MICA). Further, a self-tuning technique and the eigenvalues of the normalized Laplacian matrix are applied to determine the optimal parameters for the spectral clustering of the similarity matrix of the miRNAs. Conclusions Experimental results demonstrate that our method has better clustering performance than the existing edge-based, node-based or hybrid methods. Our method has also demonstrated a novel usefulness for the function annotation of new miRNAs, as reported in the detailed case studies.
Collapse
|
28
|
Peng J, Li H, Liu Y, Juan L, Jiang Q, Wang Y, Chen J. InteGO2: a web tool for measuring and visualizing gene semantic similarities using Gene Ontology. BMC Genomics 2016; 17 Suppl 5:530. [PMID: 27586009 PMCID: PMC5009821 DOI: 10.1186/s12864-016-2828-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Gene Ontology (GO) has been used in high-throughput omics research as a major bioinformatics resource. The hierarchical structure of GO provides users a convenient platform for biological information abstraction and hypothesis testing. Computational methods have been developed to identify functionally similar genes. However, none of the existing measurements take into account all the rich information in GO. Similarly, using these existing methods, web-based applications have been constructed to compute gene functional similarities, and to provide pure text-based outputs. Without a graphical visualization interface, it is difficult for result interpretation. RESULTS We present InteGO2, a web tool that allows researchers to calculate the GO-based gene semantic similarities using seven widely used GO-based similarity measurements. Also, we provide an integrative measurement that synergistically integrates all the individual measurements to improve the overall performance. Using HTML5 and cytoscape.js, we provide a graphical interface in InteGO2 to visualize the resulting gene functional association networks. CONCLUSIONS InteGO2 is an easy-to-use HTML5 based web tool. With it, researchers can measure gene or gene product functional similarity conveniently, and visualize the network of functional interactions in a graphical interface. InteGO2 can be accessed via http://mlg.hit.edu.cn:8089/ .
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.,Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, 48824, MI, USA
| | - Hongxiang Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongzhuang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Jin Chen
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, 48824, MI, USA. .,Department of Computer Science and Engineering, Michigan State University, East Lansing, 48824, MI, USA.
| |
Collapse
|
29
|
Bastos HP, Sousa L, Clarke LA, Couto FM. Functional coherence metrics in protein families. J Biomed Semantics 2016; 7:41. [PMID: 27338101 PMCID: PMC4917928 DOI: 10.1186/s13326-016-0076-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 05/17/2016] [Indexed: 12/03/2022] Open
Abstract
Background Biological sequences, such as proteins, have been provided with annotations that assign functional information. These functional annotations are associations of proteins (or other biological sequences) with descriptors characterizing their biological roles. However, not all proteins are fully (or even at all) annotated. This annotation incompleteness limits our ability to make sound assertions about the functional coherence within sets of proteins. Annotation incompleteness is a problematic issue when measuring semantic functional similarity of biological sequences since they can only capture a limited amount of all the semantic aspects the sequences may encompass. Methods Instead of relying uniquely on single (reductive) metrics, this work proposes a comprehensive approach for assessing functional coherence within protein sets. The approach entails using visualization and term enrichment techniques anchored in specific domain knowledge, such as a protein family. For that purpose we evaluate two novel functional coherence metrics, mUI and mGIC that combine aspects of semantic similarity measures and term enrichment. Results These metrics were used to effectively capture and measure the local similarity cores within protein sets. Hence, these metrics coupled with visualization tools allow an improved grasp on three important functional annotation aspects: completeness, agreement and coherence. Conclusions Measuring the functional similarity between proteins based on their annotations is a non trivial task. Several metrics exist but due both to characteristics intrinsic to the nature of graphs and extrinsic natures related to the process of annotation each measure can only capture certain functional annotation aspects of proteins. Hence, when trying to measure the functional coherence of a set of proteins a single metric is too reductive. Therefore, it is valuable to be aware of how each employed similarity metric works and what similarity aspects it can best capture. Here we test the behaviour and resilience of some similarity metrics. Electronic supplementary material The online version of this article (doi:10.1186/s13326-016-0076-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hugo P Bastos
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Lisete Sousa
- CEAUL, Departamento de Estatística e Investigação Operacional, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749-016, Portugal
| | - Luka A Clarke
- BioISI - Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, Lisboa, 1749-016, Portugal
| | - Francisco M Couto
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.
| |
Collapse
|
30
|
Zhang SB, Tang QR. Protein-protein interaction inference based on semantic similarity of Gene Ontology terms. J Theor Biol 2016; 401:30-7. [PMID: 27117309 DOI: 10.1016/j.jtbi.2016.04.020] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Revised: 03/14/2016] [Accepted: 04/16/2016] [Indexed: 11/29/2022]
Abstract
Identifying protein-protein interactions is important in molecular biology. Experimental methods to this issue have their limitations, and computational approaches have attracted more and more attentions from the biological community. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most powerful indicators for protein interaction. However, conventional methods based on GO similarity fail to take advantage of the specificity of GO terms in the ontology graph. We proposed a GO-based method to predict protein-protein interaction by integrating different kinds of similarity measures derived from the intrinsic structure of GO graph. We extended five existing methods to derive the semantic similarity measures from the descending part of two GO terms in the GO graph, then adopted a feature integration strategy to combines both the ascending and the descending similarity scores derived from the three sub-ontologies to construct various kinds of features to characterize each protein pair. Support vector machines (SVM) were employed as discriminate classifiers, and five-fold cross validation experiments were conducted on both human and yeast protein-protein interaction datasets to evaluate the performance of different kinds of integrated features, the experimental results suggest the best performance of the feature that combines information from both the ascending and the descending parts of the three ontologies. Our method is appealing for effective prediction of protein-protein interaction.
Collapse
Affiliation(s)
- Shu-Bo Zhang
- Department of Computer Science, Guangzhou Maritime Institute, Room 803, Building 88, Dashabei Road, Huangpu District, Guangzhou 510725, PR China.
| | - Qiang-Rong Tang
- Department of Shipping, Guangzhou Marine Institute, Room 205, Shipping Building, Hongshan No. 3 Road, Huangpu District, Guangzhou 510725, PR China.
| |
Collapse
|
31
|
Koorman T, Klompstra D, van der Voet M, Lemmens I, Ramalho JJ, Nieuwenhuize S, van den Heuvel S, Tavernier J, Nance J, Boxem M. A combined binary interaction and phenotypic map of C. elegans cell polarity proteins. Nat Cell Biol 2016; 18:337-46. [PMID: 26780296 PMCID: PMC4767559 DOI: 10.1038/ncb3300] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 12/15/2015] [Indexed: 12/12/2022]
Abstract
The establishment of cell polarity is an essential process for the development of multicellular organisms and the functioning of cells and tissues. Here, we combine large-scale protein interaction mapping with systematic phenotypic profiling to study the network of physical interactions that underlies polarity establishment and maintenance in the nematode Caenorhabditis elegans. Using a fragment-based yeast two-hybrid strategy, we identified 439 interactions between 296 proteins, as well as the protein regions that mediate these interactions. Phenotypic profiling of the network resulted in the identification of 100 physically interacting protein pairs for which RNAi-mediated depletion caused a defect in the same polarity-related process. We demonstrate the predictive capabilities of the network by showing that the physical interaction between the RhoGAP PAC-1 and PAR-6 is required for radial polarization of the C. elegans embryo. Our network represents a valuable resource of candidate interactions that can be used to further our insight into cell polarization.
Collapse
Affiliation(s)
- Thijs Koorman
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| | - Diana Klompstra
- Helen L. and Martin S. Kimmel Center for Biology and Medicine at the Skirball Institute of Biomolecular Medicine, NYU School of Medicine, New York, New York 10016, USA
- Department of Cell Biology, NYU School of Medicine, New York, New York 10016, USA
| | - Monique van der Voet
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| | - Irma Lemmens
- Department of Medical Protein Research, VIB, and Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, 9000 Ghent, Belgium
| | - João J. Ramalho
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| | - Susan Nieuwenhuize
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| | - Sander van den Heuvel
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| | - Jan Tavernier
- Department of Medical Protein Research, VIB, and Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, 9000 Ghent, Belgium
| | - Jeremy Nance
- Helen L. and Martin S. Kimmel Center for Biology and Medicine at the Skirball Institute of Biomolecular Medicine, NYU School of Medicine, New York, New York 10016, USA
- Department of Cell Biology, NYU School of Medicine, New York, New York 10016, USA
| | - Mike Boxem
- Division of Developmental Biology, Department of Biology, Faculty of Science, Utrecht University, 3584 CH, Utrecht, The Netherlands
| |
Collapse
|
32
|
Pesaranghader A, Matwin S, Sokolova M, Beiko RG. simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 2015; 32:1380-7. [PMID: 26708333 DOI: 10.1093/bioinformatics/btv755] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 12/21/2015] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Measures of protein functional similarity are essential tools for function prediction, evaluation of protein-protein interactions (PPIs) and other applications. Several existing methods perform comparisons between proteins based on the semantic similarity of their GO terms; however, these measures are highly sensitive to modifications in the topological structure of GO, tend to be focused on specific analytical tasks and concentrate on the GO terms themselves rather than considering their textual definitions. RESULTS We introduce simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions, which is based on the Gloss Vector measure commonly used in natural language processing. The simDEF approach builds optimized definition vectors for all relevant GO terms, and expresses the similarity of a pair of proteins as the cosine of the angle between their definition vectors. Relative to existing similarity measures, when validated on a yeast reference database, simDEF improves correlation with sequence homology by up to 50%, shows a correlation improvement >4% with gene expression in the biological process hierarchy of GO and increases PPI predictability by > 2.5% in F1 score for molecular function hierarchy. AVAILABILITY AND IMPLEMENTATION Datasets, results and source code are available at http://kiwi.cs.dal.ca/Software/simDEF CONTACT: ahmad.pgh@dal.ca or beiko@cs.dal.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland and
| | - Marina Sokolova
- Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada
| |
Collapse
|
33
|
Liu B, Jin M, Zeng P. Prioritization of candidate disease genes by combining topological similarity and semantic similarity. J Biomed Inform 2015; 57:1-5. [DOI: 10.1016/j.jbi.2015.07.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2014] [Revised: 07/01/2015] [Accepted: 07/06/2015] [Indexed: 10/23/2022]
|
34
|
Song M, Jiang Z. Inferring Association between Compound and Pathway with an Improved Ensemble Learning Method. Mol Inform 2015; 34:753-60. [DOI: 10.1002/minf.201500033] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2015] [Accepted: 07/03/2015] [Indexed: 12/20/2022]
|
35
|
Bettembourg C, Diot C, Dameron O. Optimal Threshold Determination for Interpreting Semantic Similarity and Particularity: Application to the Comparison of Gene Sets and Metabolic Pathways Using GO and ChEBI. PLoS One 2015; 10:e0133579. [PMID: 26230274 PMCID: PMC4521860 DOI: 10.1371/journal.pone.0133579] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Accepted: 06/30/2015] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The analysis of gene annotations referencing back to Gene Ontology plays an important role in the interpretation of high-throughput experiments results. This analysis typically involves semantic similarity and particularity measures that quantify the importance of the Gene Ontology annotations. However, there is currently no sound method supporting the interpretation of the similarity and particularity values in order to determine whether two genes are similar or whether one gene has some significant particular function. Interpretation is frequently based either on an implicit threshold, or an arbitrary one (typically 0.5). Here we investigate a method for determining thresholds supporting the interpretation of the results of a semantic comparison. RESULTS We propose a method for determining the optimal similarity threshold by minimizing the proportions of false-positive and false-negative similarity matches. We compared the distributions of the similarity values of pairs of similar genes and pairs of non-similar genes. These comparisons were performed separately for all three branches of the Gene Ontology. In all situations, we found overlap between the similar and the non-similar distributions, indicating that some similar genes had a similarity value lower than the similarity value of some non-similar genes. We then extend this method to the semantic particularity measure and to a similarity measure applied to the ChEBI ontology. Thresholds were evaluated over the whole HomoloGene database. For each group of homologous genes, we computed all the similarity and particularity values between pairs of genes. Finally, we focused on the PPAR multigene family to show that the similarity and particularity patterns obtained with our thresholds were better at discriminating orthologs and paralogs than those obtained using default thresholds. CONCLUSION We developed a method for determining optimal semantic similarity and particularity thresholds. We applied this method on the GO and ChEBI ontologies. Qualitative analysis using the thresholds on the PPAR multigene family yielded biologically-relevant patterns.
Collapse
Affiliation(s)
- Charles Bettembourg
- Université de Rennes 1, Rennes, France
- INRA, UMR1348 PEGASE, Saint-Gilles, France
- Agrocampus OUEST, UMR1348 PEGASE, Rennes, France
- IRISA, Campus de Beaulieu, Rennes, France
- INRIA, Rennes, France
- * E-mail:
| | - Christian Diot
- INRA, UMR1348 PEGASE, Saint-Gilles, France
- Agrocampus OUEST, UMR1348 PEGASE, Rennes, France
| | - Olivier Dameron
- Université de Rennes 1, Rennes, France
- IRISA, Campus de Beaulieu, Rennes, France
- INRIA, Rennes, France
| |
Collapse
|
36
|
Scoring the correlation of genes by their shared properties using OScal, an improved overlap quantification model. Sci Rep 2015; 5:10583. [PMID: 26015386 PMCID: PMC4445036 DOI: 10.1038/srep10583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 04/20/2015] [Indexed: 11/17/2022] Open
Abstract
Scoring the correlation between two genes by their shared properties is a common and basic work in biological study. A prospective way to score this correlation is to quantify the overlap between the two sets of homogeneous properties of the two genes. However the proper model has not been decided, here we focused on studying the quantification of overlap and proposed a more effective model after theoretically compared 7 existing models. We defined three characteristic parameters (d, R, r) of an overlap, which highlight essential differences among the 7 models and grouped them into two classes. Then the pros and cons of the two groups of model were fully examined by their solution space in the (d, R, r) coordinate system. Finally we proposed a new model called OScal (Overlap Score calculator), which was modified on Poisson distribution (one of 7 models) to avoid its disadvantages. Tested in assessing gene relation using different data, OScal performs better than existing models. In addition, OScal is a basic mathematic model, with very low computation cost and few restrictive conditions, so it can be used in a wide-range of research areas to measure the overlap or similarity of two entities.
Collapse
|
37
|
Zhang SB, Lai JH. Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 2015; 558:108-17. [DOI: 10.1016/j.gene.2014.12.062] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 12/15/2014] [Accepted: 12/24/2014] [Indexed: 11/25/2022]
|
38
|
Peng J, Uygun S, Kim T, Wang Y, Rhee SY, Chen J. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks. BMC Bioinformatics 2015; 16:44. [PMID: 25886899 PMCID: PMC4339680 DOI: 10.1186/s12859-015-0474-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 01/26/2015] [Indexed: 01/18/2023] Open
Abstract
Background Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited. Supplementary information and software are available at http://www.msu.edu/~jinchen/NETSIM. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0474-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. .,Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA.
| | - Sahra Uygun
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA. .,Genetics Program, Michigan State University, East Lansing, MI, 48824, USA.
| | - Taehyong Kim
- Department of Plant Biology, Carnegie Institution for Science, 260 Panama St, Stanford, CA, 94305, USA.
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Seung Y Rhee
- Department of Plant Biology, Carnegie Institution for Science, 260 Panama St, Stanford, CA, 94305, USA.
| | - Jin Chen
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, MI, 48824, USA. .,Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
39
|
Peng J, Li H, Jiang Q, Wang Y, Chen J. An integrative approach for measuring semantic similarities using gene ontology. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 5:S8. [PMID: 25559943 PMCID: PMC4305987 DOI: 10.1186/1752-0509-8-s5-s8] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Background Gene Ontology (GO) provides rich information and a convenient way to study gene functional similarity, which has been successfully used in various applications. However, the existing GO based similarity measurements have limited functions for only a subset of GO information is considered in each measure. An appropriate integration of the existing measures to take into account more information in GO is demanding. Results We propose a novel integrative measure called InteGO2 to automatically select appropriate seed measures and then to integrate them using a metaheuristic search method. The experiment results show that InteGO2 significantly improves the performance of gene similarity in human, Arabidopsis and yeast on both molecular function and biological process GO categories. Conclusions InteGO2 computes gene-to-gene similarities more accurately than tested existing measures and has high robustness. The supplementary document and software are available at http://mlg.hit.edu.cn:8082/.
Collapse
|
40
|
Na D, Son H, Gsponer J. Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity. BMC Genomics 2014; 15:1091. [PMID: 25495442 PMCID: PMC4298957 DOI: 10.1186/1471-2164-15-1091] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 12/04/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Communalities between large sets of genes obtained from high-throughput experiments are often identified by searching for enrichments of genes with the same Gene Ontology (GO) annotations. The GO analysis tools used for these enrichment analyses assume that GO terms are independent and the semantic distances between all parent-child terms are identical, which is not true in a biological sense. In addition these tools output lists of often redundant or too specific GO terms, which are difficult to interpret in the context of the biological question investigated by the user. Therefore, there is a demand for a robust and reliable method for gene categorization and enrichment analysis. RESULTS We have developed Categorizer, a tool that classifies genes into user-defined groups (categories) and calculates p-values for the enrichment of the categories. Categorizer identifies the biologically best-fit category for each gene by taking advantage of a specialized semantic similarity measure for GO terms. We demonstrate that Categorizer provides improved categorization and enrichment results of genetic modifiers of Huntington's disease compared to a classical GO Slim-based approach or categorizations using other semantic similarity measures. CONCLUSION Categorizer enables more accurate categorizations of genes than currently available methods. This new tool will help experimental and computational biologists analyzing genomic and proteomic data according to their specific needs in a more reliable manner.
Collapse
Affiliation(s)
| | | | - Jörg Gsponer
- Department of Biochemistry and Molecular Biology, Centre for High-throughput Biology, University of British Columbia, 2125 East Mall, Vancouver, BC V6T 1Z4, Canada.
| |
Collapse
|
41
|
Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type? PLoS One 2014; 9:e113859. [PMID: 25474538 PMCID: PMC4256219 DOI: 10.1371/journal.pone.0113859] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2014] [Accepted: 10/31/2014] [Indexed: 12/23/2022] Open
Abstract
The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.
Collapse
|
42
|
Batet M, Harispe S, Ranwez S, Sánchez D, Ranwez V. An information theoretic approach to improve semantic similarity assessments across multiple ontologies. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.06.039] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
43
|
Abstract
Background In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents. Results We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). Conclusions InteGO is a systematic and novel method to study gene functional associations. The software and description are available at http://www.msu.edu/~jinchen/InteGO.
Collapse
|
44
|
Bandyopadhyay S, Mallick K. A New Path Based Hybrid Measure for Gene Ontology Similarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:116-127. [PMID: 26355512 DOI: 10.1109/tcbb.2013.149] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene Ontology (GO) consists of a controlled vocabulary of terms, annotating a gene or gene product, structured in a directed acyclic graph. In the graph, semantic relations connect the terms, that represent the knowledge of functional description and cellular component information of gene products. GO similarity gives us a numerical representation of biological relationship between a gene set, which can be used to infer various biological facts such as protein interaction, structural similarity, gene clustering, etc. Here we introduce a new shortest path based hybrid measure of ontological similarity between two terms which combines both structure of the GO graph and information content of the terms. Here the similarity between two terms t1 and t2, referred to as GOSim(PBHM)(t1,t2), has two components; one obtained from the common ancestors of t1 and t2. The other from their remaining ancestors. The proposed path based hybrid measure does not suffer from the well-known shallow annotation problem. Its superiority with respect to some other popular measures is established for protein protein interaction prediction, correlation with gene expression and functional classification of genes in a biological pathway. Finally, the proposed measure is utilized to compute the average GO similarity score among the genes that are experimentally validated targets of some microRNAs. Results demonstrate that the targets of a given miRNA have a high degree of similarity in the biological process category of GO.
Collapse
|
45
|
Measuring the evolution of ontology complexity: the gene ontology case study. PLoS One 2013; 8:e75993. [PMID: 24146805 PMCID: PMC3795689 DOI: 10.1371/journal.pone.0075993] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 08/20/2013] [Indexed: 01/09/2023] Open
Abstract
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure. The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred. The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
Collapse
|