1
|
Mazandu GK, Hooper C, Opap K, Makinde F, Nembaware V, Thomford NE, Chimusa ER, Wonkam A, Mulder NJ. IHP-PING-generating integrated human protein-protein interaction networks on-the-fly. Brief Bioinform 2020; 22:5943797. [PMID: 33129201 DOI: 10.1093/bib/bbaa277] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 09/12/2020] [Accepted: 09/21/2020] [Indexed: 01/04/2023] Open
Abstract
Advances in high-throughput sequencing technologies have resulted in an exponential growth of publicly accessible biological datasets. In the 'big data' driven 'post-genomic' context, much work is being done to explore human protein-protein interactions (PPIs) for a systems level based analysis to uncover useful signals and gain more insights to advance current knowledge and answer specific biological and health questions. These PPIs are experimentally or computationally predicted, stored in different online databases and some of PPI resources are updated regularly. As with many biological datasets, such regular updates continuously render older PPI datasets potentially outdated. Moreover, while many of these interactions are shared between these online resources, each resource includes its own identified PPIs and none of these databases exhaustively contains all existing human PPI maps. In this context, it is essential to enable the integration of or combining interaction datasets from different resources, to generate a PPI map with increased coverage and confidence. To allow researchers to produce an integrated human PPI datasets in real-time, we introduce the integrated human protein-protein interaction network generator (IHP-PING) tool. IHP-PING is a flexible python package which generates a human PPI network from freely available online resources. This tool extracts and integrates heterogeneous PPI datasets to generate a unified PPI network, which is stored locally for further applications.
Collapse
Affiliation(s)
- Gaston K Mazandu
- Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI-Africa WT Centre, University of Cape Town, Health Sciences Campus. Anzio Rd, Observatory, 7925, South Africa.,African Institute for Mathematical Sciences, 5-7 Melrose Road, Muizenberg, 7945, Cape Town, South Africa.,Division of Human Genetics, Department of Pathology, University of Cape Town, Health Sciences Campus, Anzio Rd, Observatory, 7925, South Africa
| | - Christopher Hooper
- Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI-Africa WT Centre, University of Cape Town, Health Sciences Campus. Anzio Rd, Observatory, 7925, South Africa
| | - Kenneth Opap
- Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI-Africa WT Centre, University of Cape Town, Health Sciences Campus. Anzio Rd, Observatory, 7925, South Africa
| | - Funmilayo Makinde
- Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI-Africa WT Centre, University of Cape Town, Health Sciences Campus. Anzio Rd, Observatory, 7925, South Africa.,African Institute for Mathematical Sciences, 5-7 Melrose Road, Muizenberg, 7945, Cape Town, South Africa
| | - Victoria Nembaware
- Division of Human Genetics, Department of Pathology, University of Cape Town, Health Sciences Campus, Anzio Rd, Observatory, 7925, South Africa
| | - Nicholas E Thomford
- Division of Human Genetics, Department of Pathology, University of Cape Town, Health Sciences Campus, Anzio Rd, Observatory, 7925, South Africa.,School of Medical Sciences, University of Cape Coast, PMB, Cape Coast, Ghana
| | - Emile R Chimusa
- Division of Human Genetics, Department of Pathology, University of Cape Town, Health Sciences Campus, Anzio Rd, Observatory, 7925, South Africa
| | - Ambroise Wonkam
- Division of Human Genetics, Department of Pathology, University of Cape Town, Health Sciences Campus, Anzio Rd, Observatory, 7925, South Africa
| | - Nicola J Mulder
- Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI-Africa WT Centre, University of Cape Town, Health Sciences Campus. Anzio Rd, Observatory, 7925, South Africa
| |
Collapse
|
2
|
Abstract
In recent years several methods have been proposed to assign pairwise mechanism- based similarity scores to human diseases. Despite their differences in approach and performance, these methods work in a somewhat similar manner: first a set of biomolecules (genes, proteins, chemicals, etc.) is associated with each disease, and then a measure is defined to calculate the similarity between the sets assigned to a pair of diseases. Since the similarity score between two diseases is defined based on the underlying molecular processes, a high score may hint at a shared cause, and therefore a similar treatment, for both diseases. This is of great practical importance especially when a rare or newly-discovered disease, for which limited information is available, is found to be related to a disease with a known treatment. Thus, in this mini-review we briefly discuss the recently developed methods for computing mechanism-based disease- disease similarities.
Collapse
Affiliation(s)
- Mehdi B Hamaneh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
3
|
Wang L, Himmelstein DS, Santaniello A, Parvin M, Baranzini SE. iCTNet2: integrating heterogeneous biological interactions to understand complex traits. F1000Res 2015; 4:485. [PMID: 26834985 PMCID: PMC4706053 DOI: 10.12688/f1000research.6836.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/25/2015] [Indexed: 09/25/2024] Open
Abstract
iCTNet (integrated Complex Traits Networks) version 2 is a Cytoscape app and database that allows researchers to build heterogeneous networks by integrating a variety of biological interactions, thus offering a systems-level view of human complex traits. iCTNet2 is built from a variety of large-scale biological datasets, collected from public repositories to facilitate the building, visualization and analysis of heterogeneous biological networks in a comprehensive fashion via the Cytoscape platform. iCTNet2 is freely available at the Cytoscape app store.
Collapse
Affiliation(s)
- Lili Wang
- School of Computing, Queen’s University, Kingston, Ontario, K7L 3N6, Canada
| | - Daniel S. Himmelstein
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, San Francisco, CA, 94143-0523, USA
| | - Adam Santaniello
- Department of Neurology, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Mousavi Parvin
- School of Computing, Queen’s University, Kingston, Ontario, K7L 3N6, Canada
| | - Sergio E. Baranzini
- Department of Neurology, University of California San Francisco, San Francisco, CA, 94158, USA
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, San Francisco, CA, 94143-0523, USA
| |
Collapse
|
4
|
Wang L, Himmelstein DS, Santaniello A, Parvin M, Baranzini SE. iCTNet2: integrating heterogeneous biological interactions to understand complex traits. F1000Res 2015; 4:485. [PMID: 26834985 PMCID: PMC4706053 DOI: 10.12688/f1000research.6836.2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/25/2015] [Indexed: 01/03/2023] Open
Abstract
iCTNet (integrated Complex Traits Networks) version 2 is a Cytoscape app and database that allows researchers to build heterogeneous networks by integrating a variety of biological interactions, thus offering a systems-level view of human complex traits. iCTNet2 is built from a variety of large-scale biological datasets, collected from public repositories to facilitate the building, visualization and analysis of heterogeneous biological networks in a comprehensive fashion via the Cytoscape platform. iCTNet2 is freely available at the Cytoscape app store.
Collapse
Affiliation(s)
- Lili Wang
- School of Computing, Queen's University, Kingston, Ontario, K7L 3N6, Canada
| | - Daniel S Himmelstein
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, San Francisco, CA, 94143-0523, USA
| | - Adam Santaniello
- Department of Neurology, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Mousavi Parvin
- School of Computing, Queen's University, Kingston, Ontario, K7L 3N6, Canada
| | - Sergio E Baranzini
- Department of Neurology, University of California San Francisco, San Francisco, CA, 94158, USA; Graduate Program in Biological and Medical Informatics, University of California, San Francisco, San Francisco, CA, 94143-0523, USA
| |
Collapse
|
5
|
Himmelstein DS, Baranzini SE. Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes. PLoS Comput Biol 2015; 11:e1004259. [PMID: 26158728 PMCID: PMC4497619 DOI: 10.1371/journal.pcbi.1004259] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 03/26/2015] [Indexed: 12/13/2022] Open
Abstract
The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks—graphs with multiple node and edge types—for accomplishing both tasks. First we constructed a network with 18 node types—genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections—and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as influential mechanisms explaining pathogenesis. Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes. Finally, we combined our network predictions with statistical evidence of association to propose four novel MS genes, three of which (JAK2, REL, RUNX3) validated on the masked GWAS. Furthermore, our predictions provide biological support highlighting REL as the causal gene within its gene-rich locus. Users can browse all predictions online (http://het.io). Heterogeneous network edge prediction effectively prioritized genetic associations and provides a powerful new approach for data integration across multiple domains. For complex human diseases, identifying the genes harboring susceptibility variants has taken on medical importance. Disease-associated genes provide clues for elucidating disease etiology, predicting disease risk, and highlighting therapeutic targets. Here, we develop a method to predict whether a given gene and disease are associated. To capture the multitude of biological entities underlying pathogenesis, we constructed a heterogeneous network, containing multiple node and edge types. We built on a technique developed for social network analysis, which embraces disparate sources of data to make predictions from heterogeneous networks. Using the compendium of associations from genome-wide studies, we learned the influential mechanisms underlying pathogenesis. Our findings provide a novel perspective about the existence of pervasive pleiotropy across complex diseases. Furthermore, we suggest transcriptional signatures of perturbations are an underutilized resource amongst prioritization approaches. For multiple sclerosis, we demonstrated our ability to prioritize future studies and discover novel susceptibility genes. Researchers can use these predictions to increase the statistical power of their studies, to suggest the causal genes from a set of candidates, or to generate evidence-based experimental hypothesis.
Collapse
Affiliation(s)
- Daniel S. Himmelstein
- Biological & Medical Informatics, University of California, San Francisco, San Francisco, California, United States of America
| | - Sergio E. Baranzini
- Biological & Medical Informatics, University of California, San Francisco, San Francisco, California, United States of America
- Department of Neurology, University of California, San Francisco, San Francisco, California, United States of America
- Institute for Human Genetics, University of California, San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
6
|
Hamaneh MB, Yu YK. DeCoaD: determining correlations among diseases using protein interaction networks. BMC Res Notes 2015; 8:226. [PMID: 26047952 PMCID: PMC4467632 DOI: 10.1186/s13104-015-1211-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 05/22/2015] [Indexed: 11/23/2022] Open
Abstract
Background Disease–disease similarities can be investigated from multiple perspectives. Identifying similar diseases based on the underlying biomolecular interactions can be especially useful, because it may shed light on the common causes of the diseases and therefore may provide clues for possible treatments. Here we introduce DeCoaD, a web-based program that uses a novel method to assign pair-wise similarity scores, called correlations, to genetic diseases. Findings DeCoaD uses a random walk to model the flow of information in a network within which nodes are either diseases or proteins and links signify either protein–protein interactions or disease–protein associations. For each protein node, the total number of visits by the random walker is called the weight of that node. Using a disease as both the starting and the terminating points of the random walks, a corresponding vector, whose elements are the weights associated with the proteins, can be constructed. The similarity between two diseases is defined as the cosine of the angle between their associated vectors. For a user-specified disease, DeCoaD outputs a list of similar diseases (with their corresponding correlations), and a graphical representation of the disease families that they belong to. Based on a probabilistic clustering algorithm, DeCoaD also outputs the clusters that the disease of interest is a member of, and the corresponding probabilities. The program also provides an interface to run enrichment analysis for the given disease or for any of the clusters that contains it. Conclusions DeCoaD uses a novel algorithm to suggest non-trivial similarities between diseases with known gene associations, and also clusters the diseases based on their similarity scores. DeCoaD is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/mn/DeCoaD/.
Collapse
Affiliation(s)
- Mehdi B Hamaneh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
7
|
Hamaneh MB, Yu YK. Relating diseases by integrating gene associations and information flow through protein interaction network. PLoS One 2014; 9:e110936. [PMID: 25360770 PMCID: PMC4216010 DOI: 10.1371/journal.pone.0110936] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Accepted: 09/27/2014] [Indexed: 12/31/2022] Open
Abstract
Identifying similar diseases could potentially provide deeper understanding of their underlying causes, and may even hint at possible treatments. For this purpose, it is necessary to have a similarity measure that reflects the underpinning molecular interactions and biological pathways. We have thus devised a network-based measure that can partially fulfill this goal. Our method assigns weights to all proteins (and consequently their encoding genes) by using information flow from a disease to the protein interaction network and back. Similarity between two diseases is then defined as the cosine of the angle between their corresponding weight vectors. The proposed method also provides a way to suggest disease-pathway associations by using the weights assigned to the genes to perform enrichment analysis for each disease. By calculating pairwise similarities between 2534 diseases, we show that our disease similarity measure is strongly correlated with the probability of finding the diseases in the same disease family and, more importantly, sharing biological pathways. We have also compared our results to those of MimMiner, a text-mining method that assigns pairwise similarity scores to diseases. We find the results of the two methods to be complementary. It is also shown that clustering diseases based on their similarities and performing enrichment analysis for the cluster centers significantly increases the term association rate, suggesting that the cluster centers are better representatives for biological pathways than the diseases themselves. This lends support to the view that our similarity measure is a good indicator of relatedness of biological processes involved in causing the diseases. Although not needed for understanding this paper, the raw results are available for download for further study at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/DiseaseRelations/.
Collapse
Affiliation(s)
- Mehdi Bagheri Hamaneh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
- * E-mail:
| |
Collapse
|
8
|
Stojmirović A, Yu YK. Building a hierarchical organization of protein complexes out of protein association data. PLoS One 2014; 9:e100098. [PMID: 24978199 PMCID: PMC4076247 DOI: 10.1371/journal.pone.0100098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2014] [Accepted: 05/22/2014] [Indexed: 11/18/2022] Open
Abstract
Organizing experimentally determined protein associations as a hierarchy can be a good approach to elucidating the content of protein complexes and the modularity of subcomplexes. Several challenges exist. First, intrinsically sticky proteins, such as chaperones, are often falsely assigned to many functionally unrelated complexes. Second, the reported collections of proteins may not be true "complexes" in the sense that they bind together and perform a joint cellular function. Third, due to imperfect sensitivity of protein detection methods, both false positive and false negative assignments of a protein to complexes may occur. We mitigate the first issue by down-weighting sticky proteins by their occurrence frequencies. We approach the other two problems by merging nearly identical complexes and by constructing a directed acyclic graph (DAG) based on the relationship of partial inclusion. The constructed DAG, within which smaller complexes form parts of the larger, can reveal how different complexes are joined. By merging almost identical complexes one can deemphasize the influence of false positives, while allowing false negatives to be rescued by other nearly identical association data. We investigate several protein weighting schemes and compare their corresponding DAGs using yeast and human complexes. We find that the scheme incorporating weights based on information flow in the network of direct protein-protein interactions produces biologically most meaningful DAGs. In either yeast or human, isolated nodes form a large proportion of the final hierarchy. While most connected components encompass very few nodes, the largest one for each species contains a sizable portion of all nodes. By considering examples of subgraphs composed of nodes containing a specified protein, we illustrate that the graphs' topological features can correctly suggest the biological roles of protein complexes. The input data, final results and the source code are available at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/ProteinComplexDAG/.
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States
- * E-mail:
| |
Collapse
|
9
|
Teku GN, Ortutay C, Vihinen M. Identification of core T cell network based on immunome interactome. BMC SYSTEMS BIOLOGY 2014; 8:17. [PMID: 24528953 PMCID: PMC3937033 DOI: 10.1186/1752-0509-8-17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Accepted: 02/05/2014] [Indexed: 12/03/2022]
Abstract
Background Data-driven studies on the dynamics of reconstructed protein-protein interaction (PPI) networks facilitate investigation and identification of proteins important for particular processes or diseases and reduces time and costs of experimental verification. Modeling the dynamics of very large PPI networks is computationally costly. Results To circumvent this problem, we created a link-weighted human immunome interactome and performed filtering. We reconstructed the immunome interactome and weighed the links using jackknife gene expression correlation of integrated, time course gene expression data. Statistical significance of the links was computed using the Global Statistical Significance (GloSS) filtering algorithm. P-values from GloSS were computed for the integrated, time course gene expression data. We filtered the immunome interactome to identify core components of the T cell PPI network (TPPIN). The interconnectedness of the major pathways for T cell survival and response, including the T cell receptor, MAPK and JAK-STAT pathways, are maintained in the TPPIN network. The obtained TPPIN network is supported both by Gene Ontology term enrichment analysis along with study of essential genes enrichment. Conclusions By integrating gene expression data to the immunome interactome and using a weighted network filtering method, we identified the T cell PPI immune response network. This network reveals the most central and crucial network in T cells. The approach is general and applicable to any dataset that contains sufficient information.
Collapse
Affiliation(s)
| | | | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, Sweden.
| |
Collapse
|
10
|
Abstract
Digenic inheritance (DI) is the simplest form of inheritance for genetically complex diseases. By contrast with the thousands of reports that mutations in single genes cause human diseases, there are only dozens of human disease phenotypes with evidence for DI in some pedigrees. The advent of high-throughput sequencing (HTS) has made it simpler to identify monogenic disease causes and could similarly simplify proving DI because one can simultaneously find mutations in two genes in the same sample. However, through 2012, I could find only one example of human DI in which HTS was used; in that example, HTS found only the second of the two genes. To explore the gap between expectation and reality, I tried to collect all examples of human DI with a narrow definition and characterise them according to the types of evidence collected, and whether there has been replication. Two strong trends are that knowledge of candidate genes and knowledge of protein–protein interactions (PPIs) have been helpful in most published examples of human DI. By contrast, the positional method of genetic linkage analysis, has been mostly unsuccessful in identifying genes underlying human DI. Based on the empirical data, I suggest that combining HTS with growing networks of established PPIs may expedite future discoveries of human DI and strengthen the evidence for them.
Collapse
|
11
|
Mora A, Donaldson IM. iRefR: an R package to manipulate the iRefIndex consolidated protein interaction database. BMC Bioinformatics 2011; 12:455. [PMID: 22115179 PMCID: PMC3282787 DOI: 10.1186/1471-2105-12-455] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2011] [Accepted: 11/24/2011] [Indexed: 11/19/2022] Open
Abstract
Background The iRefIndex addresses the need to consolidate protein interaction data into a single uniform data resource. iRefR provides the user with access to this data source from an R environment. Results The iRefR package includes tools for selecting specific subsets of interest from the iRefIndex by criteria such as organism, source database, experimental method, protein accessions and publication identifier. Data may be converted between three representations (MITAB, edgeList and graph) for use with other R packages such as igraph, graph and RBGL. The user may choose between different methods for resolving redundancies in interaction data and how n-ary data is represented. In addition, we describe a function to identify binary interaction records that possibly represent protein complexes. We show that the user choice of data selection, redundancy resolution and n-ary data representation all have an impact on graphical analysis. Conclusions The package allows the user to control how these issues are dealt with and communicate them via an R-script written using the iRefR package - this will facilitate communication of methods, reproducibility of network analyses and further modification and comparison of methods by researchers.
Collapse
Affiliation(s)
- Antonio Mora
- Department for Molecular Biosciences, University of Oslo, P,O, Box 1041 Blindern, 0316 Oslo, Norway
| | | |
Collapse
|