1
|
Li Q, Wang Y, Wang J, Zhao C. Improving drug repositioning accuracy using non-negative matrix tri-factorization. Sci Rep 2025; 15:7840. [PMID: 40050702 PMCID: PMC11885831 DOI: 10.1038/s41598-025-91757-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Accepted: 02/24/2025] [Indexed: 03/09/2025] Open
Abstract
Drug repositioning is a transformative approach in drug discovery, offering a pathway to repurpose existing drugs for new therapeutic uses. In this study, we introduce the IDDNMTF model designed to predict drug repositioning opportunities with greater precision. The IDDNMTF model integrates multiple datasets, allowing for a more comprehensive analysis of drug-disease associations. We evaluated the IDDNMTF model using various combinations of datasets and found that its performance, as measured by AUC, AUPR, and F1 scores, improved with the inclusion of more data. This trend underscores the importance of data diversity in strengthening predictive capabilities. Comparatively, the IDDNMTF model demonstrated superior performance against the NMF model, solidifying its potential in drug repositioning. In summary, the IDDNMTF model offers a promising tool for identifying new therapeutic uses for existing drugs. Its predictive accuracy and interpretability are poised to accelerate the transition from bench to bedside, contributing to personalized medicine and the development of targeted treatments.
Collapse
Affiliation(s)
- Qingmei Li
- Honghui Hospital, Xi'an Jiaotong University, Xi'an, 710054, China
| | - Yangyang Wang
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Jihan Wang
- Shaanxi Provincial Key Laboratory of Infection and Immune Diseases, Shaanxi Provincial People's Hospital, Xi'an, 710068, China
| | - Congzhe Zhao
- Honghui Hospital, Xi'an Jiaotong University, Xi'an, 710054, China.
| |
Collapse
|
2
|
Ortjohann M, Leippe M. Molecular characterization of two newly recognized lysozymes of the protist Dictyostelium discoideum. DEVELOPMENTAL AND COMPARATIVE IMMUNOLOGY 2025; 164:105334. [PMID: 39909204 DOI: 10.1016/j.dci.2025.105334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 01/23/2025] [Accepted: 02/02/2025] [Indexed: 02/07/2025]
Abstract
The model organism Dictyostelium discoideum functions as a social amoeba that can aggregate, eventually forming a fruiting body composed of a fixed number of cells. This behavior requires a soluble counting factor (CF) complex, which plays a key role in group size determination and has been identified earlier. The CF complex comprises among others the proteins CF45-1 and CF50. Although both proteins share sequence similarities with characterized Chalaropsis- and Entamoeba-type lysozymes, enzymatic activity has not been confirmed until now. CF lysozymes have unusual sequence characteristics consisting of an N-terminal glycoside hydrolase family 25 (GH25) domain and a C-terminal low-complexity region rich in serine, glycine, alanine, and asparagine residues. In this study, we present the production and purification of soluble recombinant CF lysozymes and demonstrate notable enzymatic activity, in particular for CF50. Additionally, a truncated version of CF50, which lacks the C-terminal low-complexity region, displayed significantly enhanced lysozyme activity compared to the entire enzyme. Both CF lysozymes exerted strict pH dependence with maximal activity observed under acidic conditions at pH 3.0-3.5. Moreover, the enzymes displayed highest activity at low ionic strengths and were stable at relatively low temperatures only. Using structural modeling and site-directed mutagenesis, we identified a glutamic acid residue essential for catalysis. Conclusively, we propose a neighboring group catalytic mechanism analogous to that of other GH25 lysozymes.
Collapse
Affiliation(s)
- Marius Ortjohann
- Comparative Immunobiology, Zoological Institute, Christian-Albrechts-Universität Kiel, Am Botanischen Garten 1-9, D-24118, Kiel, Germany
| | - Matthias Leippe
- Comparative Immunobiology, Zoological Institute, Christian-Albrechts-Universität Kiel, Am Botanischen Garten 1-9, D-24118, Kiel, Germany.
| |
Collapse
|
3
|
Luo H, Yang H, Zhang G, Wang J, Luo J, Yan C. KGRDR: a deep learning model based on knowledge graph and graph regularized integration for drug repositioning. Front Pharmacol 2025; 16:1525029. [PMID: 40008124 PMCID: PMC11850324 DOI: 10.3389/fphar.2025.1525029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Accepted: 01/13/2025] [Indexed: 02/27/2025] Open
Abstract
Computational drug repositioning, serving as an effective alternative to traditional drug discovery plays a key role in optimizing drug development. This approach can accelerate the development of new therapeutic options while reducing costs and mitigating risks. In this study, we propose a novel deep learning-based framework KGRDR containing multi-similarity integration and knowledge graph learning to predict potential drug-disease interactions. Specifically, a graph regularized approach is applied to integrate multiple drug and disease similarity information, which can effectively eliminate noise data and obtain integrated similarity features of drugs and diseases. Then, topological feature representations of drugs and diseases are learned from constructed biomedical knowledge graphs (KGs) which encompasses known drug-related and disease-related interactions. Next, the similarity features and topological features are fused by utilizing an attention-based feature fusion method. Finally, drug-disease associations are predicted using the graph convolutional network. Experimental results demonstrate that KGRDR achieves better performance when compared with the state-of-the-art drug-disease prediction methods. Moreover, case study results further validate the effectiveness of KGRDR in predicting novel drug-disease interactions.
Collapse
Affiliation(s)
- Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Hui Yang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
| | - Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
- Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, China
- Academy for Advanced Interdisciplinary Studies, Henan University, Zhengzhou, China
| |
Collapse
|
4
|
Korlepara DB, C S V, Srivastava R, Pal PK, Raza SH, Kumar V, Pandit S, Nair AG, Pandey S, Sharma S, Jeurkar S, Thakran K, Jaglan R, Verma S, Ramachandran I, Chatterjee P, Nayar D, Priyakumar UD. PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications. Sci Data 2024; 11:180. [PMID: 38336857 PMCID: PMC10858175 DOI: 10.1038/s41597-023-02872-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 12/21/2023] [Indexed: 02/12/2024] Open
Abstract
Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski's rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.
Collapse
Affiliation(s)
- Divya B Korlepara
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
- Divison of Physics, School of Advanced Sciences, Vellore Institute of Technology, Chennai, 600127, India
| | - Vasavi C S
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
- Department of Artificial Intelligence, School of Artificial Intelligence, Amrita Vishwa Vidyapeetham, Bengaluru, 560035, India
| | - Rakesh Srivastava
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Pradeep Kumar Pal
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Saalim H Raza
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Vishal Kumar
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shivam Pandit
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Aathira G Nair
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Sanjana Pandey
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shubham Sharma
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shruti Jeurkar
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India
| | - Kavita Thakran
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Reena Jaglan
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Shivangi Verma
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Indhu Ramachandran
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Prathit Chatterjee
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India
| | - Divya Nayar
- Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, 110016, India.
| | - U Deva Priyakumar
- IHub-Data, International Institute of Information Technology, Hyderabad, 500032, India.
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, India.
| |
Collapse
|
5
|
Li MM, Huang K, Zitnik M. Graph representation learning in biomedicine and healthcare. Nat Biomed Eng 2022; 6:1353-1369. [PMID: 36316368 PMCID: PMC10699434 DOI: 10.1038/s41551-022-00942-x] [Citation(s) in RCA: 72] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 08/09/2022] [Indexed: 11/11/2022]
Abstract
Networks-or graphs-are universal descriptors of systems of interacting elements. In biomedicine and healthcare, they can represent, for example, molecular interactions, signalling pathways, disease co-morbidities or healthcare systems. In this Perspective, we posit that representation learning can realize principles of network medicine, discuss successes and current limitations of the use of representation learning on graphs in biomedicine and healthcare, and outline algorithmic strategies that leverage the topology of graphs to embed them into compact vectorial spaces. We argue that graph representation learning will keep pushing forward machine learning for biomedicine and healthcare applications, including the identification of genetic variants underlying complex traits, the disentanglement of single-cell behaviours and their effects on health, the assistance of patients in diagnosis and treatment, and the development of safe and effective medicines.
Collapse
Affiliation(s)
- Michelle M Li
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kexin Huang
- Health Data Science Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
| |
Collapse
|
6
|
Zhang X, Wang W, Ren CX, Dai DQ. Learning representation for multiple biological networks via a robust graph regularized integration approach. Brief Bioinform 2021; 23:6381251. [PMID: 34607360 DOI: 10.1093/bib/bbab409] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 08/23/2021] [Accepted: 09/06/2021] [Indexed: 01/18/2023] Open
Abstract
Learning node representation is a fundamental problem in biological network analysis, as compact representation features reveal complicated network structures and carry useful information for downstream tasks such as link prediction and node classification. Recently, multiple networks that profile objects from different aspects are increasingly accumulated, providing the opportunity to learn objects from multiple perspectives. However, the complex common and specific information across different networks pose challenges to node representation methods. Moreover, ubiquitous noise in networks calls for more robust representation. To deal with these problems, we present a representation learning method for multiple biological networks. First, we accommodate the noise and spurious edges in networks using denoised diffusion, providing robust connectivity structures for the subsequent representation learning. Then, we introduce a graph regularized integration model to combine refined networks and compute common representation features. By using the regularized decomposition technique, the proposed model can effectively preserve the common structural property of different networks and simultaneously accommodate their specific information, leading to a consistent representation. A simulation study shows the superiority of the proposed method on different levels of noisy networks. Three network-based inference tasks, including drug-target interaction prediction, gene function identification and fine-grained species categorization, are conducted using representation features learned from our method. Biological networks at different scales and levels of sparsity are involved. Experimental results on real-world data show that the proposed method has robust performance compared with alternatives. Overall, by eliminating noise and integrating effectively, the proposed method is able to learn useful representations from multiple biological networks.
Collapse
Affiliation(s)
- Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Chuan-Xian Ren
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| |
Collapse
|
7
|
Shu J, Li Y, Wang S, Xi B, Ma J. Disease gene prediction with privileged information and heteroscedastic dropout. Bioinformatics 2021; 37:i410-i417. [PMID: 34252957 PMCID: PMC8275341 DOI: 10.1093/bioinformatics/btab310] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/24/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation Recently, machine learning models have achieved tremendous success in prioritizing candidate genes for genetic diseases. These models are able to accurately quantify the similarity among disease and genes based on the intuition that similar genes are more likely to be associated with similar diseases. However, the genetic features these methods rely on are often hard to collect due to high experimental cost and various other technical limitations. Existing solutions of this problem significantly increase the risk of overfitting and decrease the generalizability of the models. Results In this work, we propose a graph neural network (GNN) version of the Learning under Privileged Information paradigm to predict new disease gene associations. Unlike previous gene prioritization approaches, our model does not require the genetic features to be the same at training and test stages. If a genetic feature is hard to measure and therefore missing at the test stage, our model could still efficiently incorporate its information during the training process. To implement this, we develop a Heteroscedastic Gaussian Dropout algorithm, where the dropout probability of the GNN model is determined by another GNN model with a mirrored GNN architecture. To evaluate our method, we compared our method with four state-of-the-art methods on the Online Mendelian Inheritance in Man dataset to prioritize candidate disease genes. Extensive evaluations show that our model could improve the prediction accuracy when all the features are available compared to other methods. More importantly, our model could make very accurate predictions when >90% of the features are missing at the test stage. Availability and implementation Our method is realized with Python 3.7 and Pytorch 1.5.0 and method and data are freely available at: https://github.com/juanshu30/Disease-Gene-Prioritization-with-Privileged-Information-and-Heteroscedastic-Dropout.
Collapse
Affiliation(s)
- Juan Shu
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of HongKong, HongKong 999077, China
| | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Bowei Xi
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA
| | - Jianzhu Ma
- Institute for Artificial Intelligence, Peking University, Beijing 100871, China
| |
Collapse
|
8
|
Raimondi D, Simm J, Arany A, Moreau Y. A novel method for data fusion over Entity-Relation graphs and its application to protein-protein interaction prediction. Bioinformatics 2021; 37:2275-2281. [PMID: 33560405 DOI: 10.1093/bioinformatics/btab092] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 01/14/2021] [Accepted: 02/04/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Modern Bioinformatics is facing increasingly complex problems to solve, and we are indeed rapidly approaching an era in which the ability to seamlessly integrate heterogeneous sources of information will be crucial for the scientific progress. Here we present a novel non-linear data fusion framework that generalizes the conventional Matrix Factorization paradigm allowing inference over arbitrary Entity-Relation graphs, and we applied it to the prediction of Protein-Protein Interactions (PPIs). Improving our knowledge of Protein Protein Interaction (PPI) networks at the proteome scale is indeed crucial to understand protein function, physiological and disease states and cell life in general. RESULTS We devised three data-fusion based models for the proteome-level prediction of PPIs, and we show that our method outperforms state of the art approaches on common benchmarks. Moreover, we investigate its predictions on newly published PPIs, showing that this new data has a clear shift in its underlying distributions and we thus train and test our models on this extended dataset. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jaak Simm
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| | - Adam Arany
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| |
Collapse
|
9
|
Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Brief Bioinform 2020; 22:6023077. [PMID: 33276376 DOI: 10.1093/bib/bbaa303] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/29/2020] [Accepted: 10/10/2020] [Indexed: 01/28/2023] Open
Abstract
Disease-gene association through genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative and complementary low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- School of Computer Science and Engineering Nanyang Technological University (NTU)
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen China
| | | | - Xiao-Li Li
- Department head and principal scientist at I2R, A*STAR, Singapore
| |
Collapse
|
10
|
Lamrabet O, Jauslin T, Lima WC, Leippe M, Cosson P. The multifarious lysozyme arsenal of Dictyostelium discoideum. DEVELOPMENTAL AND COMPARATIVE IMMUNOLOGY 2020; 107:103645. [PMID: 32061941 DOI: 10.1016/j.dci.2020.103645] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Revised: 02/12/2020] [Accepted: 02/12/2020] [Indexed: 06/10/2023]
Abstract
Dictyostelium discoideum is a free-living soil amoeba which feeds upon bacteria. To bind, ingest, and kill bacteria, D. discoideum uses molecular mechanisms analogous to those found in professional phagocytic cells of multicellular organisms. D. discoideum is equipped with a large arsenal of antimicrobial peptides and proteins including amoebapore-like peptides and lysozymes. This review describes the family of lysozymes in D. discoideum. We identified 22 genes potentially encoding four different types of lysozymes in the D. discoideum genome. Although most of these genes are also present in the genomes of other amoebal species, no other organism is as well-equipped with lysozyme genes as D. discoideum.
Collapse
Affiliation(s)
- Otmane Lamrabet
- Faculty of Medicine, University of Geneva, Centre Médical Universitaire, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland.
| | - Tania Jauslin
- Faculty of Medicine, University of Geneva, Centre Médical Universitaire, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Wanessa Cristina Lima
- Faculty of Medicine, University of Geneva, Centre Médical Universitaire, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Matthias Leippe
- Zoological Institute, Comparative Immunobiology, University of Kiel, Kiel, Germany
| | - Pierre Cosson
- Faculty of Medicine, University of Geneva, Centre Médical Universitaire, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| |
Collapse
|
11
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 262] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
12
|
Zakeri P, Simm J, Arany A, ElShal S, Moreau Y. Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information. Bioinformatics 2019; 34:i447-i456. [PMID: 29949967 PMCID: PMC6022676 DOI: 10.1093/bioinformatics/bty289] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Motivation Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make non-trivial predictions for genes for which no previous disease association is known. Results Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour. Availability and implementation The Bayesian data fusion method is implemented as a Python/C++ package: https://github.com/jaak-s/macau. It is also available as a Julia package: https://github.com/jaak-s/BayesianDataFusion.jl. All data and benchmarks generated or analyzed during this study can be downloaded at https://owncloud.esat.kuleuven.be/index.php/s/UGb89WfkZwMYoTn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pooya Zakeri
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium
| | - Jaak Simm
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium
| | - Adam Arany
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium
| | - Sarah ElShal
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium
| | - Yves Moreau
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium
| |
Collapse
|
13
|
Guala D, Ogris C, Müller N, Sonnhammer ELL. Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinform 2019; 21:1224-1237. [PMID: 31281921 PMCID: PMC7373183 DOI: 10.1093/bib/bbz064] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/29/2019] [Accepted: 05/04/2019] [Indexed: 02/06/2023] Open
Abstract
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Christoph Ogris
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Nikola Müller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
14
|
Kumar AA, Van Laer L, Alaerts M, Ardeshirdavani A, Moreau Y, Laukens K, Loeys B, Vandeweyer G. pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion. Bioinformatics 2018; 34:2254-2262. [PMID: 29452392 PMCID: PMC6022555 DOI: 10.1093/bioinformatics/bty079] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2017] [Revised: 01/25/2018] [Accepted: 02/12/2018] [Indexed: 12/31/2022] Open
Abstract
Motivation Computational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations. Results pBRIT models feature dependencies and sparsity by an Information-Theoretic (data driven) approach and applies intermediate integration based data fusion. Following the hypothesis that genes underlying similar diseases will share functional and phenotype characteristics, it incorporates Bayesian Ridge regression to learn a linear mapping between functional and phenotype annotations. Genes are prioritized on phenotypic concordance to the training genes. We evaluated pBRIT against nine existing methods, and on over 2000 HPO-gene associations retrieved after construction of pBRIT data sources. We achieve maximum AUC scores ranging from 0.92 to 0.96 against benchmark datasets and of 0.80 against the time-stamped HPO entries, indicating good performance with high sensitivity and specificity. Our model shows stable performance with regard to changes in the underlying annotation data, is fast and scalable for implementation in routine pipelines. Availability and implementation http://biomina.be/apps/pbrit/; https://bitbucket.org/medgenua/pbrit. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ajay Anand Kumar
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| | - Lut Van Laer
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Maaike Alaerts
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Amin Ardeshirdavani
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Belgium
- imec, Leuven, Belgium
| | - Yves Moreau
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Belgium
- imec, Leuven, Belgium
| | - Kris Laukens
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
- ADReM Data Laboratory, University of Antwerp, Antwerp, Belgium
| | - Bart Loeys
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
| | - Geert Vandeweyer
- Center of Medical Genetics, University of Antwerp and Antwerp University Hospital, Antwerp, Belgium
- Biomedical Informatics Research Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| |
Collapse
|
15
|
Chen J, Zhang S. Matrix Integrative Analysis (MIA) of Multiple Genomic Data for Modular Patterns. Front Genet 2018; 9:194. [PMID: 29910825 PMCID: PMC5992392 DOI: 10.3389/fgene.2018.00194] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/11/2018] [Indexed: 11/13/2022] Open
Abstract
The increasing availability of high-throughput biological data, especially multi-dimensional genomic data across the same samples, has created an urgent need for modular and integrative analysis tools that can reveal the relationships among different layers of cellular activities. To this end, we present a MATLAB package, Matrix Integration Analysis (MIA), implementing and extending four published methods, designed based on two classical techniques, non-negative matrix factorization (NMF), and partial least squares (PLS). This package can integrate diverse types of genomic data (e.g., copy number variation, DNA methylation, gene expression, microRNA expression profiles, and/or gene network data) to identify the underlying modular patterns by each method. Particularly, we demonstrate the differences between these two classes of methods, which give users some suggestions about how to select a suitable method in the MIA package. MIA is a flexible tool which could handle a wide range of biological problems and data types. Besides, we also provide an executable version for users without a MATLAB license.
Collapse
Affiliation(s)
- Jinyu Chen
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, CAS, Beijing, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, CAS, Beijing, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| |
Collapse
|
16
|
Vitali F, Marini S, Pala D, Demartini A, Montoli S, Zambelli A, Bellazzi R. Patient similarity by joint matrix trifactorization to identify subgroups in acute myeloid leukemia. JAMIA Open 2018; 1:75-86. [PMID: 31984320 PMCID: PMC6951984 DOI: 10.1093/jamiaopen/ooy008] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 03/07/2018] [Accepted: 03/20/2018] [Indexed: 12/31/2022] Open
Abstract
Objective Computing patients’ similarity is of great interest in precision oncology since it supports clustering and subgroup identification, eventually leading to tailored therapies. The availability of large amounts of biomedical data, characterized by large feature sets and sparse content, motivates the development of new methods to compute patient similarities able to fuse heterogeneous data sources with the available knowledge. Materials and Methods In this work, we developed a data integration approach based on matrix trifactorization to compute patient similarities by integrating several sources of data and knowledge. We assess the accuracy of the proposed method: (1) on several synthetic data sets which similarity structures are affected by increasing levels of noise and data sparsity, and (2) on a real data set coming from an acute myeloid leukemia (AML) study. The results obtained are finally compared with the ones of traditional similarity calculation methods. Results In the analysis of the synthetic data set, where the ground truth is known, we measured the capability of reconstructing the correct clusters, while in the AML study we evaluated the Kaplan-Meier curves obtained with the different clusters and measured their statistical difference by means of the log-rank test. In presence of noise and sparse data, our data integration method outperform other techniques, both in the synthetic and in the AML data. Discussion In case of multiple heterogeneous data sources, a matrix trifactorization technique can successfully fuse all the information in a joint model. We demonstrated how this approach can be efficiently applied to discover meaningful patient similarities and therefore may be considered a reliable data driven strategy for the definition of new research hypothesis for precision oncology. Conclusion The better performance of the proposed approach presents an advantage over previous methods to provide accurate patient similarities supporting precision medicine.
Collapse
Affiliation(s)
- F Vitali
- Center for Biomedical Informatics and Biostatistics, The University of Arizona, Tucson, Arizona, USA.,BIO5 Institute, The University of Arizona, Tucson, Arizona, USA.,Department of Medicine, The University of Arizona, Tucson, AZ, USA
| | - S Marini
- Department of Computational Biology and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - D Pala
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy
| | - A Demartini
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy.,Centre for Health Technologies, University of Pavia, PV, Italy
| | - S Montoli
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy.,Centre for Health Technologies, University of Pavia, PV, Italy
| | - A Zambelli
- Oncology Unit, ASST Papa Giovanni XXIII, Bergamo, BG, Italy
| | - R Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, PV, Italy.,Centre for Health Technologies, University of Pavia, PV, Italy.,IRCCS Istituti Clinici Scientifici Maugeri, Pavia, PV, Italy
| |
Collapse
|
17
|
Carreras-Puigvert J, Zitnik M, Jemth AS, Carter M, Unterlass JE, Hallström B, Loseva O, Karem Z, Calderón-Montaño JM, Lindskog C, Edqvist PH, Matuszewski DJ, Ait Blal H, Berntsson RPA, Häggblad M, Martens U, Studham M, Lundgren B, Wählby C, Sonnhammer ELL, Lundberg E, Stenmark P, Zupan B, Helleday T. A comprehensive structural, biochemical and biological profiling of the human NUDIX hydrolase family. Nat Commun 2017; 8:1541. [PMID: 29142246 PMCID: PMC5688067 DOI: 10.1038/s41467-017-01642-w] [Citation(s) in RCA: 112] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Accepted: 10/06/2017] [Indexed: 01/04/2023] Open
Abstract
The NUDIX enzymes are involved in cellular metabolism and homeostasis, as well as mRNA processing. Although highly conserved throughout all organisms, their biological roles and biochemical redundancies remain largely unclear. To address this, we globally resolve their individual properties and inter-relationships. We purify 18 of the human NUDIX proteins and screen 52 substrates, providing a substrate redundancy map. Using crystal structures, we generate sequence alignment analyses revealing four major structural classes. To a certain extent, their substrate preference redundancies correlate with structural classes, thus linking structure and activity relationships. To elucidate interdependence among the NUDIX hydrolases, we pairwise deplete them generating an epistatic interaction map, evaluate cell cycle perturbations upon knockdown in normal and cancer cells, and analyse their protein and mRNA expression in normal and cancer tissues. Using a novel FUSION algorithm, we integrate all data creating a comprehensive NUDIX enzyme profile map, which will prove fundamental to understanding their biological functionality.
Collapse
Affiliation(s)
- Jordi Carreras-Puigvert
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden.
| | - Marinka Zitnik
- Faculty of Computer and Information Science, University of Ljubljana, SI-1000, Ljubljana, Slovenia
- Department of Computer Science, Stanford University, Palo Alto, CA, 94305, USA
| | - Ann-Sofie Jemth
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Megan Carter
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Judith E Unterlass
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Björn Hallström
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Olga Loseva
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Zhir Karem
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - José Manuel Calderón-Montaño
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden
| | - Cecilia Lindskog
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, 751 85, Uppsala, Sweden
| | - Per-Henrik Edqvist
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, 751 85, Uppsala, Sweden
| | - Damian J Matuszewski
- Centre for Image Analysis and Science for Life Laboratory, Uppsala University, Uppsala, 751 05, Sweden
| | - Hammou Ait Blal
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Ronnie P A Berntsson
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Maria Häggblad
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Ulf Martens
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Matthew Studham
- Stockholm Bioinformatics Center, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 171 21, Solna, Sweden
| | - Bo Lundgren
- Biochemical and Cellular Screening Facility, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, 171 65, Sweden
| | - Carolina Wählby
- Centre for Image Analysis and Science for Life Laboratory, Uppsala University, Uppsala, 751 05, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 171 21, Solna, Sweden
| | - Emma Lundberg
- Cell Profiling-Affinity Proteomics, Science for Life Laboratory, KTH-Royal Institute of Technology, Stockholm, 17165, Sweden
| | - Pål Stenmark
- Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden
| | - Blaz Zupan
- Faculty of Computer and Information Science, University of Ljubljana, SI-1000, Ljubljana, Slovenia
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Thomas Helleday
- Division of Translational Medicine and Chemical Biology, Science for Life Laboratory, Department of Molecular Biochemistry and Biophysics, Karolinska Institutet, Stockholm, 171 65, Sweden.
| |
Collapse
|
18
|
Abstract
Motivation: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. Results: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene–disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. Availability and implementation: Source code is at http://github.com/marinkaz/medusa Contact:marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University, CA 94305, USA Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia 1000
| | - Blaz Zupan
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia 1000 Department of Molecular and Human Genetics, Baylor College of Medicine, TX 77030, USA
| |
Collapse
|
19
|
Cho H, Berger B, Peng J. Compact Integration of Multi-Network Topology for Functional Analysis of Genes. Cell Syst 2016; 3:540-548.e5. [PMID: 27889536 DOI: 10.1016/j.cels.2016.10.017] [Citation(s) in RCA: 156] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Revised: 08/14/2016] [Accepted: 10/19/2016] [Indexed: 01/18/2023]
Abstract
The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet-unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here, we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the structure of rapidly accumulating and diverse biological network data and can be broadly applied to other network science domains.
Collapse
Affiliation(s)
- Hyunghoon Cho
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA; Department of Mathematics, MIT, Cambridge, MA 02139, USA.
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA; Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA.
| |
Collapse
|