1
|
Zhao X, Wang Q, Zhang Y, He C, Yin M, Zhao X. CBKG-DTI: Multi-Level Knowledge Distillation and Biomedical Knowledge Graph for Drug-Target Interaction Prediction. IEEE J Biomed Health Inform 2025; 29:2284-2296. [PMID: 40030432 DOI: 10.1109/jbhi.2024.3500027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
The prediction of drug-target interactions (DTIs) has emerged as a vital step in drug discovery. Recently, biomedical knowledge graph enables the utilization of multi-omics resources for modelling complex biological systems and further improves overall performance of specific predictive task. However, due to the scale and generalization of biomedical knowledge graph, it is necessary to capture task-specific knowledge from biomedical knowledge graph for DTI prediction. Moreover, although biomedical knowledge graph has rich interactions between biological entities, there still needs to contain unignorable structural information of drugs or targets in the multi-modal fusion manner. To this end, we develop a novel DTI identification framework, CBKG-DTI, which aims to distill task-specific knowledge from the complex knowledge graph to the lightweight DTI prediction model. Specifically, CBKG-DTI first introduces a hierarchy-aware knowledge graph embedding as teacher model to capture semantic hierarchy information of biomedical knowledge graph. Then, to further improve model performance, CBKG-DTI integrates information from multiple aspects such as relational information and structural information by constructing a heterogeneous network and then employs a heterogeneous graph attention network framework as the lightweight student model. Moreover, we design a multi-level distillation mechanism to improve the representation and prediction ability of the lightweight student model via capturing the representation and logit distribution of the teacher model. Finally, we conduct the extensive comparison experiments and can reach the AUC of 0.9751 and the AUPR of 0.6310 under 5-fold cross validation. This not only demonstrates the superiority of CBKG-DTI in DTI prediction, but also, more importantly, validate the effectiveness of the framework capturing task-specific knowledge from biomedical knowledge graph.
Collapse
|
2
|
Wang Y, Yin Z. Prediction of miRNA-disease association based on multisource inductive matrix completion. Sci Rep 2024; 14:27503. [PMID: 39528650 PMCID: PMC11555322 DOI: 10.1038/s41598-024-78212-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
MicroRNAs (miRNAs) are endogenous non-coding RNAs approximately 23 nucleotides in length, playing significant roles in various cellular processes. Numerous studies have shown that miRNAs are involved in the regulation of many human diseases. Accurate prediction of miRNA-disease associations is crucial for early diagnosis, treatment, and prognosis assessment of diseases. In this paper, we propose the Autoencoder Inductive Matrix Completion (AEIMC) model to identify potential miRNA-disease associations. The model captures interaction features from multiple similarity networks, including miRNA functional similarity, miRNA sequence similarity, disease semantic similarity, disease ontology similarity, and Gaussian interaction kernel similarity between miRNAs and diseases. Autoencoders are used to extract more complex and abstract data representations, which are then input into the inductive matrix completion model for association prediction. The effectiveness of the model is validated through cross-validation, stratified threshold evaluation, and case studies, while ablation experiments further confirm the necessity of introducing sequence and ontology similarities for the first time.
Collapse
Affiliation(s)
- YaWei Wang
- School of Mathematics, Physics and Statistics, Institute for Frontier Medical Technology, Center of Intelligent Computing and Applied Statistics, Shanghai University of Enginneering Science, Shanghai, 201620, China
| | - ZhiXiang Yin
- School of Mathematics, Physics and Statistics, Institute for Frontier Medical Technology, Center of Intelligent Computing and Applied Statistics, Shanghai University of Enginneering Science, Shanghai, 201620, China.
| |
Collapse
|
3
|
Zhang P, Lin P, Li D, Wang W, Qi X, Li J, Xiong J. MGACL: Prediction Drug-Protein Interaction Based on Meta-Graph Association-Aware Contrastive Learning. Biomolecules 2024; 14:1267. [PMID: 39456200 PMCID: PMC11505808 DOI: 10.3390/biom14101267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Revised: 09/20/2024] [Accepted: 10/02/2024] [Indexed: 10/28/2024] Open
Abstract
The identification of drug-target interaction (DTI) is crucial for drug discovery. However, how to reduce the graph neural network's false positives due to its bias and negative transfer in the original bipartite graph remains to be clarified. Considering that the impact of heterogeneous auxiliary information on DTI varies depending on the drug and target, we established an adaptive enhanced personalized meta-knowledge transfer network named Meta Graph Association-Aware Contrastive Learning (MGACL), which can transfer personalized heterogeneous auxiliary information from different nodes and reduce data bias. Meanwhile, we propose a novel DTI association-aware contrastive learning strategy that aligns high-frequency drug representations with learned auxiliary graph representations to prevent negative transfer. Our study improves the DTI prediction performance by about 3%, evaluated by analyzing the area under the curve (AUC) and area under the precision-recall curve (AUPRC) compared with existing methods, which is more conducive to accurately identifying drug targets for the development of new drugs.
Collapse
Affiliation(s)
- Pinglu Zhang
- Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266003, China; (P.Z.); (W.W.)
| | - Peng Lin
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China; (P.L.); (D.L.); (X.Q.)
| | - Dehai Li
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China; (P.L.); (D.L.); (X.Q.)
| | - Wanchun Wang
- Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266003, China; (P.Z.); (W.W.)
| | - Xin Qi
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China; (P.L.); (D.L.); (X.Q.)
| | - Jing Li
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China; (P.L.); (D.L.); (X.Q.)
| | - Jianshe Xiong
- Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266003, China; (P.Z.); (W.W.)
| |
Collapse
|
4
|
Zitnik M, Li MM, Wells A, Glass K, Morselli Gysi D, Krishnan A, Murali TM, Radivojac P, Roy S, Baudot A, Bozdag S, Chen DZ, Cowen L, Devkota K, Gitter A, Gosline SJC, Gu P, Guzzi PH, Huang H, Jiang M, Kesimoglu ZN, Koyuturk M, Ma J, Pico AR, Pržulj N, Przytycka TM, Raphael BJ, Ritz A, Sharan R, Shen Y, Singh M, Slonim DK, Tong H, Yang XH, Yoon BJ, Yu H, Milenković T. Current and future directions in network biology. BIOINFORMATICS ADVANCES 2024; 4:vbae099. [PMID: 39143982 PMCID: PMC11321866 DOI: 10.1093/bioadv/vbae099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 05/31/2024] [Accepted: 07/08/2024] [Indexed: 08/16/2024]
Abstract
Summary Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology. Availability and implementation Not applicable.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Michelle M Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Aydin Wells
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
| | - Deisy Morselli Gysi
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
- Department of Statistics, Federal University of Paraná, Curitiba, Paraná 81530-015, Brazil
- Department of Physics, Northeastern University, Boston, MA 02115, United States
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, United States
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, United States
| | - Sushmita Roy
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Wisconsin Institute for Discovery, Madison, WI 53715, United States
| | - Anaïs Baudot
- Aix Marseille Université, INSERM, MMG, Marseille, France
| | - Serdar Bozdag
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- Department of Mathematics, University of North Texas, Denton, TX 76203, United States
| | - Danny Z Chen
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Kapil Devkota
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Morgridge Institute for Research, Madison, WI 53715, United States
| | - Sara J C Gosline
- Biological Sciences Division, Pacific Northwest National Laboratory, Seattle, WA 98109, United States
| | - Pengfei Gu
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Pietro H Guzzi
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, 88100, Italy
| | - Heng Huang
- Department of Computer Science, University of Maryland College Park, College Park, MD 20742, United States
| | - Meng Jiang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Ziynet Nesibe Kesimoglu
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Mehmet Koyuturk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Alexander R Pico
- Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158, United States
| | - Nataša Pržulj
- Department of Computer Science, University College London, London, WC1E 6BT, England
- ICREA, Catalan Institution for Research and Advanced Studies, Barcelona, 08010, Spain
- Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain
| | - Teresa M Przytycka
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
| | - Anna Ritz
- Department of Biology, Reed College, Portland, OR 97202, United States
| | - Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Mona Singh
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, United States
| | - Donna K Slonim
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Hanghang Tong
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Xinan Holly Yang
- Department of Pediatrics, University of Chicago, Chicago, IL 60637, United States
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Computational Science Initiative, Brookhaven National Laboratory, Upton, NY 11973, United States
| | - Haiyuan Yu
- Department of Computational Biology, Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, United States
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| |
Collapse
|
5
|
Meng Z, Liu S, Liang S, Jani B, Meng Z. Heterogeneous biomedical entity representation learning for gene-disease association prediction. Brief Bioinform 2024; 25:bbae380. [PMID: 39154194 PMCID: PMC11330343 DOI: 10.1093/bib/bbae380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 05/29/2024] [Accepted: 07/22/2024] [Indexed: 08/19/2024] Open
Abstract
Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
Collapse
Affiliation(s)
- Zhaohan Meng
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| | - Siwei Liu
- School of Natural and Computing Science, University of Aberdeen King’s College, Aberdeen, AB24 3FX, UK
| | - Shangsong Liang
- Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Building 1B, Masdar City, Abu Dhabi 000000, UAE
| | - Bhautesh Jani
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| | - Zaiqiao Meng
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| |
Collapse
|
6
|
Cao J, Chen Q, Qiu J, Wang Y, Lan W, Du X, Tan K. NGCN: Drug-target interaction prediction by integrating information and feature learning from heterogeneous network. J Cell Mol Med 2024; 28:e18224. [PMID: 38509739 PMCID: PMC10955156 DOI: 10.1111/jcmm.18224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 02/14/2024] [Accepted: 02/26/2024] [Indexed: 03/22/2024] Open
Abstract
Drug-target interaction (DTI) prediction is essential for new drug design and development. Constructing heterogeneous network based on diverse information about drugs, proteins and diseases provides new opportunities for DTI prediction. However, the inherent complexity, high dimensionality and noise of such a network prevent us from taking full advantage of these network characteristics. This article proposes a novel method, NGCN, to predict drug-target interactions from an integrated heterogeneous network, from which to extract relevant biological properties and association information while maintaining the topology information. It focuses on learning the topology representation of drugs and targets to improve the performance of DTI prediction. Unlike traditional methods, it focuses on learning the low-dimensional topology representation of drugs and targets via graph-based convolutional neural network. NGCN achieves substantial performance improvements over other state-of-the-art methods, such as a nearly 1.0% increase in AUPR value. Moreover, we verify the robustness of NGCN through benchmark tests, and the experimental results demonstrate it is an extensible framework capable of combining heterogeneous information for DTI prediction.
Collapse
Affiliation(s)
- Junyue Cao
- College of Life Science and TechnologyGuangxi UniversityNanningChina
| | - Qingfeng Chen
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| | - Junlai Qiu
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| | - Yiming Wang
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| | - Wei Lan
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| | - Xiaojing Du
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| | - Kai Tan
- School of Computer, Electronics and InformationGuangxi UniversityNanningChina
| |
Collapse
|
7
|
Tayebi J, BabaAli B. EKGDR: An End-to-End Knowledge Graph-Based Method for Computational Drug Repurposing. J Chem Inf Model 2024; 64:1868-1881. [PMID: 38483449 DOI: 10.1021/acs.jcim.3c01925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
The lengthy and expensive process of developing new drugs from scratch, coupled with a high failure rate, has prompted the emergence of drug repurposing/repositioning as a more efficient and cost-effective approach. This approach involves identifying new therapeutic applications for existing approved drugs, leveraging the extensive drug-related data already gathered. However, the diversity and heterogeneity of data, along with the limited availability of known drug-disease interactions, pose significant challenges to computational drug design. To address these challenges, this study introduces EKGDR, an end-to-end knowledge graph-based approach for computational drug repurposing. EKGDR utilizes the power of a drug knowledge graph, a comprehensive repository of drug-related information that encompasses known drug interactions and various categorization information, as well as structural molecular descriptors of drugs. EKGDR employs graph neural networks, a cutting-edge graph representation learning technique, to embed the drug knowledge graph (nodes and relations) in an end-to-end manner. By doing so, EKGDR can effectively learn the underlying causes (intents) behind drug-disease interactions and recursively aggregate and combine relational messages between nodes along different multihop neighborhood paths (relational paths). This process generates representations of disease and drug nodes, enabling EKGDR to predict the interaction probability for each drug-disease pair in an end-to-end manner. The obtained results demonstrate that EKGDR outperforms previous models in all three evaluation metrics: area under the receiver operating characteristic curve (AUROC = 0.9475), area under the precision-recall curve (AUPRC = 0.9490), and recall at the top-200 recommendations (Recall@200 = 0.8315). To further validate EKGDR's effectiveness, we evaluated the top-20 candidate drugs suggested for each of Alzheimer's and Parkinson's diseases.
Collapse
Affiliation(s)
- Javad Tayebi
- School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran 141556455, Iran
| | - Bagher BabaAli
- School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran 141556455, Iran
| |
Collapse
|
8
|
Jin S, Zhang Y, Yu H, Lu M. SADR: Self-Supervised Graph Learning With Adaptive Denoising for Drug Repositioning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:265-277. [PMID: 38190661 DOI: 10.1109/tcbb.2024.3351079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/10/2024]
Abstract
Traditional drug development is often high-risk and time-consuming. A promising alternative is to reuse or relocate approved drugs. Recently, some methods based on graph representation learning have started to be used for drug repositioning. These models learn the low dimensional embeddings of drug and disease nodes from the drug-disease interaction network to predict the potential association between drugs and diseases. However, these methods have strict requirements for the dataset, and if the dataset is sparse, the performance of these methods will be severely affected. At the same time, these methods have poor robustness to noise in the dataset. In response to the above challenges, we propose a drug repositioning model based on self-supervised graph learning with adptive denoising, called SADR. SADR uses data augmentation and contrastive learning strategies to learn feature representations of nodes, which can effectively solve the problems caused by sparse datasets. SADR includes an adaptive denoising training (ADT) component that can effectively identify noisy data during the training process and remove the impact of noise on the model. We have conducted comprehensive experiments on three datasets and have achieved better prediction accuracy compared to multiple baseline models. At the same time, we propose the top 10 new predictive approved drugs for treating two diseases. This demonstrates the ability of our model to identify potential drug candidates for disease indications.
Collapse
|
9
|
Luo Y, Liu XY, Yang K, Huang K, Hong M, Zhang J, Wu Y, Nie Z. Toward Unified AI Drug Discovery with Multimodal Knowledge. HEALTH DATA SCIENCE 2024; 4:0113. [PMID: 38486623 PMCID: PMC10886071 DOI: 10.34133/hds.0113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/25/2024] [Indexed: 03/17/2024]
Abstract
Background: In real-world drug discovery, human experts typically grasp molecular knowledge of drugs and proteins from multimodal sources including molecular structures, structured knowledge from knowledge bases, and unstructured knowledge from biomedical literature. Existing multimodal approaches in AI drug discovery integrate either structured or unstructured knowledge independently, which compromises the holistic understanding of biomolecules. Besides, they fail to address the missing modality problem, where multimodal information is missing for novel drugs and proteins. Methods: In this work, we present KEDD, a unified, end-to-end deep learning framework that jointly incorporates both structured and unstructured knowledge for vast AI drug discovery tasks. The framework first incorporates independent representation learning models to extract the underlying characteristics from each modality. Then, it applies a feature fusion technique to calculate the prediction results. To mitigate the missing modality problem, we leverage sparse attention and a modality masking technique to reconstruct the missing features based on top relevant molecules. Results: Benefiting from structured and unstructured knowledge, our framework achieves a deeper understanding of biomolecules. KEDD outperforms state-of-the-art models by an average of 5.2% on drug-target interaction prediction, 2.6% on drug property prediction, 1.2% on drug-drug interaction prediction, and 4.1% on protein-protein interaction prediction. Through qualitative analysis, we reveal KEDD's promising potential in assisting real-world applications. Conclusions: By incorporating biomolecular expertise from multimodal knowledge, KEDD bears promise in accelerating drug discovery.
Collapse
Affiliation(s)
- Yizhen Luo
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
- Department of Computer Science and Technology,
Tsinghua University, Beijing, China
| | - Xing Yi Liu
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
| | - Kai Yang
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
| | - Kui Huang
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
- School of Software and Microelectronics,
Peking University, Beijing, China
| | - Massimo Hong
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
- Department of Computer Science and Technology,
Tsinghua University, Beijing, China
| | - Jiahuan Zhang
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
| | - Yushuai Wu
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
| | - Zaiqing Nie
- Institute for AI Industry Research (AIR),
Tsinghua University, Beijing, China
- Beijing Academy of Artificial Intelligence (BAAI), Beijing, China
| |
Collapse
|
10
|
Wei J, Lu L, Shen T. Predicting drug-protein interactions by preserving the graph information of multi source data. BMC Bioinformatics 2024; 25:10. [PMID: 38177981 PMCID: PMC10768380 DOI: 10.1186/s12859-023-05620-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 12/15/2023] [Indexed: 01/06/2024] Open
Abstract
Examining potential drug-target interactions (DTIs) is a pivotal component of drug discovery and repurposing. Recently, there has been a significant rise in the use of computational techniques to predict DTIs. Nevertheless, previous investigations have predominantly concentrated on assessing either the connections between nodes or the consistency of the network's topological structure in isolation. Such one-sided approaches could severely hinder the accuracy of DTI predictions. In this study, we propose a novel method called TTGCN, which combines heterogeneous graph convolutional neural networks (GCN) and graph attention networks (GAT) to address the task of DTI prediction. TTGCN employs a two-tiered feature learning strategy, utilizing GAT and residual GCN (R-GCN) to extract drug and target embeddings from the diverse network, respectively. These drug and target embeddings are then fused through a mean-pooling layer. Finally, we employ an inductive matrix completion technique to forecast DTIs while preserving the network's node connectivity and topological structure. Our approach demonstrates superior performance in terms of area under the curve and area under the precision-recall curve in experimental comparisons, highlighting its significant advantages in predicting DTIs. Furthermore, case studies provide additional evidence of its ability to identify potential DTIs.
Collapse
Affiliation(s)
- Jiahao Wei
- School of Mathematical Sciences, Guizhou Normal University, Guiyang, 550025, China
| | - Linzhang Lu
- School of Mathematical Sciences, Guizhou Normal University, Guiyang, 550025, China.
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, China.
| | - Tie Shen
- Key Laboratory of Information and Computing Science Guizhou Province, Guizhou Normal University, Guizhou, 550001, China.
| |
Collapse
|
11
|
Sharma D, Xu W. ReGeNNe: genetic pathway-based deep neural network using canonical correlation regularizer for disease prediction. Bioinformatics 2023; 39:btad679. [PMID: 37963055 PMCID: PMC10666205 DOI: 10.1093/bioinformatics/btad679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/06/2023] [Accepted: 11/13/2023] [Indexed: 11/16/2023] Open
Abstract
MOTIVATION Common human diseases result from the interplay of genes and their biologically associated pathways. Genetic pathway analyses provide more biological insight as compared to conventional gene-based analysis. In this article, we propose a framework combining genetic data into pathway structure and using an ensemble of convolutional neural networks (CNNs) along with a Canonical Correlation Regularizer layer for comprehensive prediction of disease risk. The novelty of our approach lies in our two-step framework: (i) utilizing the CNN's effectiveness to extract the complex gene associations within individual genetic pathways and (ii) fusing features from ensemble of CNNs through Canonical Correlation Regularization layer to incorporate the interactions between pathways which share common genes. During prediction, we also address the important issues of interpretability of neural network models, and identifying the pathways and genes playing an important role in prediction. RESULTS Implementation of our methodology into three real cancer genetic datasets for different prediction tasks validates our model's generalizability and robustness. Comparing with conventional models, our methodology provides consistently better performance with AUC improvement of 11% on predicting early/late-stage kidney cancer, 10% on predicting kidney versus liver cancer type and 7% on predicting survival status in ovarian cancer as compared to the next best conventional machine learning model. The robust performance of our deep learning algorithm indicates that disease prediction using neural networks in multiple functionally related genes across different pathways improves genetic data-based prediction and understanding molecular mechanisms of diseases. AVAILABILITY AND IMPLEMENTATION https://github.com/divya031090/ReGeNNe.
Collapse
Affiliation(s)
- Divya Sharma
- Biostatistics Department, Princess Margaret Cancer Center, University Health Network, Toronto, ON M5G2C4, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | - Wei Xu
- Biostatistics Department, Princess Margaret Cancer Center, University Health Network, Toronto, ON M5G2C4, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| |
Collapse
|
12
|
Sheng N, Huang L, Gao L, Cao Y, Xie X, Wang Y. A Survey of Computational Methods and Databases for lncRNA-MiRNA Interaction Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2810-2826. [PMID: 37030713 DOI: 10.1109/tcbb.2023.3264254] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) are two prevalent non-coding RNAs in current research. They play critical regulatory roles in the life processes of animals and plants. Studies have shown that lncRNAs can interact with miRNAs to participate in post-transcriptional regulatory processes, mainly involved in regulating cancer development, metastatic progression, and drug resistance. Additionally, these interactions have significant effects on plant growth, development, and responses to biotic and abiotic stresses. Deciphering the potential relationships between lncRNAs and miRNAs may provide new insights into our understanding of the biological functions of lncRNAs and miRNAs, and the pathogenesis of complex diseases. In contrast, gathering information on lncRNA-miRNA interactions (LMIs) through biological experiments is expensive and time-consuming. With the accumulation of multi-omics data, computational models are extremely attractive in systematically exploring potential LMIs. To the best of our knowledge, this is the first comprehensive review of computational methods for identifying LMIs. Specifically, we first summarized the available public databases for predicting animal and plant LMIs. Second, we comprehensively reviewed the computational methods for predicting LMIs and classified them into two categories, including network-based methods and sequence-based methods. Third, we analyzed the standard evaluation methods and metrics used in LMI prediction. Finally, we pointed out some problems in the current study and discuss future research directions. Relevant databases and the latest advances in LMI prediction are summarized in a GitHub repository https://github.com/sheng-n/lncRNA-miRNA-interaction-methods, and we'll keep it updated.
Collapse
|
13
|
Chen P, Zheng H. Drug-target interaction prediction based on spatial consistency constraint and graph convolutional autoencoder. BMC Bioinformatics 2023; 24:151. [PMID: 37069493 PMCID: PMC10109239 DOI: 10.1186/s12859-023-05275-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 04/05/2023] [Indexed: 04/19/2023] Open
Abstract
BACKGROUND Drug-target interaction (DTI) prediction plays an important role in drug discovery and repositioning. However, most of the computational methods used for identifying relevant DTIs do not consider the invariance of the nearest neighbour relationships between drugs or targets. In other words, they do not take into account the invariance of the topological relationships between nodes during representation learning. It may limit the performance of the DTI prediction methods. RESULTS Here, we propose a novel graph convolutional autoencoder-based model, named SDGAE, to predict DTIs. As the graph convolutional network cannot handle isolated nodes in a network, a pre-processing step was applied to reduce the number of isolated nodes in the heterogeneous network and facilitate effective exploitation of the graph convolutional network. By maintaining the graph structure during representation learning, the nearest neighbour relationships between nodes in the embedding space remained as close as possible to the original space. CONCLUSIONS Overall, we demonstrated that SDGAE can automatically learn more informative and robust feature vectors of drugs and targets, thus exhibiting significantly improved predictive accuracy for DTIs.
Collapse
Affiliation(s)
- Peng Chen
- School of Computer Science and Technology, University of Science and Technology of China, Jinzhai Road 96, Hefei, 230027, People's Republic of China
- Anhui Key Laboratory of Software Engineering in Computing and Communication, University of Science and Technology of China, Jinzhai Road 96, Hefei, 230027, People's Republic of China
| | - Haoran Zheng
- School of Computer Science and Technology, University of Science and Technology of China, Jinzhai Road 96, Hefei, 230027, People's Republic of China.
- Anhui Key Laboratory of Software Engineering in Computing and Communication, University of Science and Technology of China, Jinzhai Road 96, Hefei, 230027, People's Republic of China.
| |
Collapse
|
14
|
Wang Z, Gu Y, Zheng S, Yang L, Li J. MGREL: A multi-graph representation learning-based ensemble learning method for gene-disease association prediction. Comput Biol Med 2023; 155:106642. [PMID: 36805231 DOI: 10.1016/j.compbiomed.2023.106642] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/15/2023] [Accepted: 02/05/2023] [Indexed: 02/12/2023]
Abstract
The identification of gene-disease associations plays an important role in the exploration of pathogenic mechanisms and therapeutic targets. Computational methods have been regarded as an effective way to discover the potential gene-disease associations in recent years. However, most of them ignored the combination of abundant genetic, therapeutic information, and gene-disease network topology. To this end, we re-organized the current gene-disease association benchmark dataset by extracting the newest gene-disease associations from the OMIM database. Then, we developed a multi-graph representation learning-based ensemble model, named MGREL to predict gene-disease associations. MGREL integrated two feature generation channels to extract gene and disease features, including a knowledge extraction channel which learned high-order representations from genetic and therapeutic information, and a graph learning channel which acquired network topological representations through multiple advanced graph representation learning methods. Then, an ensemble learning method with 5 machine learning models was used as the classifier to predict the gene-disease association. Comprehensive experiments have demonstrated the significant performance achieved by MGREL compared to 5 state-of-the-art methods. For the major measurements (AUC = 0.925, AUPR = 0.935), the relative improvements of MGREL compared to the suboptimal methods are 3.24%, and 2.75%, respectively. MGREL also achieved impressive improvements in the challenging tasks of predicting potential associations for unknown genes/diseases. In addition, case studies implied potential applications for MGREL in the discovery of potential therapeutic targets.
Collapse
Affiliation(s)
- Ziyang Wang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Yaowen Gu
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Si Zheng
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China; Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, 100084, China
| | - Lin Yang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Jiao Li
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China.
| |
Collapse
|
15
|
DRaW: prediction of COVID-19 antivirals by deep learning-an objection on using matrix factorization. BMC Bioinformatics 2023; 24:52. [PMID: 36793010 PMCID: PMC9931173 DOI: 10.1186/s12859-023-05181-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 02/09/2023] [Indexed: 02/17/2023] Open
Abstract
BACKGROUND Due to the high resource consumption of introducing a new drug, drug repurposing plays an essential role in drug discovery. To do this, researchers examine the current drug-target interaction (DTI) to predict new interactions for the approved drugs. Matrix factorization methods have much attention and utilization in DTIs. However, they suffer from some drawbacks. METHODS We explain why matrix factorization is not the best for DTI prediction. Then, we propose a deep learning model (DRaW) to predict DTIs without having input data leakage. We compare our model with several matrix factorization methods and a deep model on three COVID-19 datasets. In addition, to ensure the validation of DRaW, we evaluate it on benchmark datasets. Furthermore, as an external validation, we conduct a docking study on the COVID-19 recommended drugs. RESULTS In all cases, the results confirm that DRaW outperforms matrix factorization and deep models. The docking results approve the top-ranked recommended drugs for COVID-19. CONCLUSIONS In this paper, we show that it may not be the best choice to use matrix factorization in the DTI prediction. Matrix factorization methods suffer from some intrinsic issues, e.g., sparsity in the domain of bioinformatics applications and fixed-unchanged size of the matrix-related paradigm. Therefore, we propose an alternative method (DRaW) that uses feature vectors rather than matrix factorization and demonstrates better performance than other famous methods on three COVID-19 and four benchmark datasets.
Collapse
|
16
|
Mai TT. From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis. ENTROPY (BASEL, SWITZERLAND) 2023; 25:333. [PMID: 36832699 PMCID: PMC9955477 DOI: 10.3390/e25020333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 02/08/2023] [Accepted: 02/09/2023] [Indexed: 06/18/2023]
Abstract
In this paper, we study the problem of bilinear regression, a type of statistical modeling that deals with multiple variables and multiple responses. One of the main difficulties that arise in this problem is the presence of missing data in the response matrix, a problem known as inductive matrix completion. To address these issues, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method starts by addressing the problem of bilinear regression using a quasi-Bayesian approach. The quasi-likelihood method that we employ in this step allows us to handle the complex relationships between the variables in a more robust way. Next, we adapt our approach to the context of inductive matrix completion. We make use of a low-rankness assumption and leverage the powerful PAC-Bayes bound technique to provide statistical properties for our proposed estimators and for the quasi-posteriors. To compute the estimators, we propose a Langevin Monte Carlo method to obtain approximate solutions to the problem of inductive matrix completion in a computationally efficient manner. To demonstrate the effectiveness of our proposed methods, we conduct a series of numerical studies. These studies allow us to evaluate the performance of our estimators under different conditions and provide a clear illustration of the strengths and limitations of our approach.
Collapse
Affiliation(s)
- The Tien Mai
- Department of Mathematical Sciences, Norwegian University of Science and Technology, 7034 Trondheim, Norway
| |
Collapse
|
17
|
Ren ZH, You ZH, Zou Q, Yu CQ, Ma YF, Guan YJ, You HR, Wang XF, Pan J. DeepMPF: deep learning framework for predicting drug-target interactions based on multi-modal representation with meta-path semantic analysis. J Transl Med 2023; 21:48. [PMID: 36698208 PMCID: PMC9876420 DOI: 10.1186/s12967-023-03876-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 01/05/2023] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Drug-target interaction (DTI) prediction has become a crucial prerequisite in drug design and drug discovery. However, the traditional biological experiment is time-consuming and expensive, as there are abundant complex interactions present in the large size of genomic and chemical spaces. For alleviating this phenomenon, plenty of computational methods are conducted to effectively complement biological experiments and narrow the search spaces into a preferred candidate domain. Whereas, most of the previous approaches cannot fully consider association behavior semantic information based on several schemas to represent complex the structure of heterogeneous biological networks. Additionally, the prediction of DTI based on single modalities cannot satisfy the demand for prediction accuracy. METHODS We propose a multi-modal representation framework of 'DeepMPF' based on meta-path semantic analysis, which effectively utilizes heterogeneous information to predict DTI. Specifically, we first construct protein-drug-disease heterogeneous networks composed of three entities. Then the feature information is obtained under three views, containing sequence modality, heterogeneous structure modality and similarity modality. We proposed six representative schemas of meta-path to preserve the high-order nonlinear structure and catch hidden structural information of the heterogeneous network. Finally, DeepMPF generates highly representative comprehensive feature descriptors and calculates the probability of interaction through joint learning. RESULTS To evaluate the predictive performance of DeepMPF, comparison experiments are conducted on four gold datasets. Our method can obtain competitive performance in all datasets. We also explore the influence of the different feature embedding dimensions, learning strategies and classification methods. Meaningfully, the drug repositioning experiments on COVID-19 and HIV demonstrate DeepMPF can be applied to solve problems in reality and help drug discovery. The further analysis of molecular docking experiments enhances the credibility of the drug candidates predicted by DeepMPF. CONCLUSIONS All the results demonstrate the effectively predictive capability of DeepMPF for drug-target interactions. It can be utilized as a useful tool to prescreen the most potential drug candidates for the protein. The web server of the DeepMPF predictor is freely available at http://120.77.11.78/DeepMPF/ , which can help relevant researchers to further study.
Collapse
Affiliation(s)
- Zhong-Hao Ren
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Zhu-Hong You
- grid.440588.50000 0001 0307 1240School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Quan Zou
- grid.54549.390000 0004 0369 4060Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054 China
| | - Chang-Qing Yu
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Yan-Fang Ma
- grid.417234.70000 0004 1808 3203Department of Galactophore, The Third People’s Hospital of Gansu Province, Lanzhou, 730020 China
| | - Yong-Jian Guan
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Hai-Ru You
- grid.440588.50000 0001 0307 1240School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Xin-Fei Wang
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Jie Pan
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| |
Collapse
|
18
|
Wang H, Wang X, Liu W, Xie X, Peng S. deepDGA: Biomedical Heterogeneous Network-based Deep Learning Framework for Disease-Gene Association Predictions. 2022 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM) 2022:601-606. [DOI: 10.1109/bibm55620.2022.9995651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Affiliation(s)
- Hong Wang
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Xiaoqi Wang
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Wenjuan Liu
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Xiaolan Xie
- Guilin University of Technology,College of Information Science and Engineering,Guilin,China
| | - Shaoliang Peng
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| |
Collapse
|
19
|
Huang L, Zhang L, Chen X. Updated review of advances in microRNAs and complex diseases: towards systematic evaluation of computational models. Brief Bioinform 2022; 23:6712303. [PMID: 36151749 DOI: 10.1093/bib/bbac407] [Citation(s) in RCA: 58] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/11/2022] [Accepted: 08/20/2022] [Indexed: 12/14/2022] Open
Abstract
Currently, there exist no generally accepted strategies of evaluating computational models for microRNA-disease associations (MDAs). Though K-fold cross validations and case studies seem to be must-have procedures, the value of K, the evaluation metrics, and the choice of query diseases as well as the inclusion of other procedures (such as parameter sensitivity tests, ablation studies and computational cost reports) are all determined on a case-by-case basis and depending on the researchers' choices. In the current review, we include a comprehensive analysis on how 29 state-of-the-art models for predicting MDAs were evaluated. Based on the analytical results, we recommend a feasible evaluation workflow that would suit any future model to facilitate fair and systematic assessment of predictive performance.
Collapse
Affiliation(s)
- Li Huang
- Academy of Arts and Design, Tsinghua University, Beijing, 10084, China.,The Future Laboratory, Tsinghua University, Beijing, 10084, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.,Artificial Intelligence Research Institute, China University of Mining and Technology, Xuzhou, 221116, China
| |
Collapse
|
20
|
Ye C, Swiers R, Bonner S, Barrett I. A Knowledge Graph-Enhanced Tensor Factorisation Model for Discovering Drug Targets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3070-3080. [PMID: 35939454 DOI: 10.1109/tcbb.2022.3197320] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The drug discovery and development process is a long and expensive one, costing over 1 billion USD on average per drug and taking 10-15 years. To reduce the high levels of attrition throughout the process, there has been a growing interest in applying machine learning methodologies to various stages of drug discovery and development in the recent decade, especially at the earliest stage - identification of druggable disease genes. In this paper, we have developed a new tensor factorisation model to predict potential drug targets (genes or proteins) for treating diseases. We created a three-dimensional data tensor consisting of 1,048 gene targets, 860 diseases and 230,011 evidence attributes and clinical outcomes connecting them, using data extracted from the Open Targets and PharmaProjects databases. We enriched the data with gene target representations learned from a drug discovery-oriented knowledge graph and applied our proposed method to predict the clinical outcomes for unseen gene target and disease pairs. We designed three evaluation strategies to measure the prediction performance and benchmarked several commonly used machine learning classifiers together with Bayesian matrix and tensor factorisation methods. The result shows that incorporating knowledge graph embeddings significantly improves the prediction accuracy and that training tensor factorisation alongside a dense neural network outperforms all other baselines. In summary, our framework combines two actively studied machine learning approaches to disease target identification, namely tensor factorisation and knowledge graph representation learning, which could be a promising avenue for further exploration in data-driven drug discovery.
Collapse
|
21
|
Zhang Y, Wang Y, Li X, Liu Y, Chen M. Identifying lncRNA–disease association based on GAT multiple-operator aggregation and inductive matrix completion. Front Genet 2022; 13:1029300. [DOI: 10.3389/fgene.2022.1029300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Accepted: 10/03/2022] [Indexed: 11/13/2022] Open
Abstract
Computable models as a fundamental candidate for traditional biological experiments have been applied in inferring lncRNA–disease association (LDA) for many years, without time-consuming and laborious limitations. However, sparsity inherently existing in known heterogeneous bio-data is an obstacle to computable models to improve prediction accuracy further. Therefore, a new computational model composed of multiple mechanisms for lncRNA–disease association (MM-LDA) prediction was proposed, based on the fusion of the graph attention network (GAT) and inductive matrix completion (IMC). MM-LDA has two key steps to improve prediction accuracy: first, a multiple-operator aggregation was designed in the n-heads attention mechanism of the GAT. With this step, features of lncRNA nodes and disease nodes were enhanced. Second, IMC was introduced into the enhanced node features obtained in the first step, and then the LDA network was reconstructed to solve the cold start problem when data deficiency of the entire row or column happened in a known association matrix. Our MM-LDA achieved the following progress: first, using the Adam optimizer that adaptively adjusted the model learning rate could increase the convergent speed and not fall into local optima as well. Second, more excellent predictive ability was achieved against other similar models (with an AUC value of 0.9395 and an AUPR value of 0.8057 obtained from 5-fold cross-validation). Third, a 6.45% lower time cost was consumed against the advanced model GAMCLDA. In short, our MM-LDA achieved a more comprehensive prediction performance in terms of prediction accuracy and time cost.
Collapse
|
22
|
Mariappan R, Jayagopal A, Sien HZ, Rajan V. Neural Collective Matrix Factorization for integrated analysis of heterogeneous biomedical data. Bioinformatics 2022; 38:4554-4561. [PMID: 35929808 DOI: 10.1093/bioinformatics/btac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 06/30/2022] [Accepted: 08/03/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION In many biomedical studies, there arises the need to integrate data from multiple directly or indirectly related sources. Collective matrix factorization (CMF) and its variants are models designed to collectively learn from arbitrary collections of matrices. The latent factors learnt are rich integrative representations that can be used in downstream tasks, such as clustering or relation prediction with standard machine-learning models. Previous CMF-based methods have numerous modeling limitations. They do not adequately capture complex non-linear interactions and do not explicitly model varying sparsity and noise levels in the inputs, and some cannot model inputs with multiple datatypes. These inadequacies limit their use on many biomedical datasets. RESULTS To address these limitations, we develop Neural Collective Matrix Factorization (NCMF), the first fully neural approach to CMF. We evaluate NCMF on relation prediction tasks of gene-disease association prediction and adverse drug event prediction, using multiple datasets. In each case, data are obtained from heterogeneous publicly available databases and used to learn representations to build predictive models. NCMF is found to outperform previous CMF-based methods and several state-of-the-art graph embedding methods for representation learning in our experiments. Our experiments illustrate the versatility and efficacy of NCMF in representation learning for seamless integration of heterogeneous data. AVAILABILITY AND IMPLEMENTATION https://github.com/ajayago/NCMF_bioinformatics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ragunathan Mariappan
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Aishwarya Jayagopal
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Ho Zong Sien
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Vaibhav Rajan
- Department of Information Systems and Analytics, School of Computing, National University of Singapore, Singapore 117417, Singapore
| |
Collapse
|
23
|
Wang L, Wong L, Li Z, Huang Y, Su X, Zhao B, You Z. A machine learning framework based on multi-source feature fusion for circRNA-disease association prediction. Brief Bioinform 2022; 23:6693603. [PMID: 36070867 DOI: 10.1093/bib/bbac388] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 07/26/2022] [Accepted: 08/11/2022] [Indexed: 11/14/2022] Open
Abstract
Circular RNAs (circRNAs) are involved in the regulatory mechanisms of multiple complex diseases, and the identification of their associations is critical to the diagnosis and treatment of diseases. In recent years, many computational methods have been designed to predict circRNA-disease associations. However, most of the existing methods rely on single correlation data. Here, we propose a machine learning framework for circRNA-disease association prediction, called MLCDA, which effectively fuses multiple sources of heterogeneous information including circRNA sequences and disease ontology. Comprehensive evaluation in the gold standard dataset showed that MLCDA can successfully capture the complex relationships between circRNAs and diseases and accurately predict their potential associations. In addition, the results of case studies on real data show that MLCDA significantly outperforms other existing methods. MLCDA can serve as a useful tool for circRNA-disease association prediction, providing mechanistic insights for disease research and thus facilitating the progress of disease treatment.
Collapse
Affiliation(s)
- Lei Wang
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning, 530007, China
| | - Leon Wong
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning, 530007, China
| | - Zhengwei Li
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning, 530007, China
| | - Yuan Huang
- Department of Computing, Hong Kong Polytechnic University, Hong Kong 999077, China
| | - Xiaorui Su
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Bowei Zhao
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Zhuhong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| |
Collapse
|
24
|
Ren ZH, You ZH, Yu CQ, Li LP, Guan YJ, Guo LX, Pan J. A biomedical knowledge graph-based method for drug-drug interactions prediction through combining local and global features with deep neural networks. Brief Bioinform 2022; 23:6692550. [PMID: 36070624 DOI: 10.1093/bib/bbac363] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 07/23/2022] [Accepted: 08/02/2022] [Indexed: 11/12/2022] Open
Abstract
Drug-drug interactions (DDIs) prediction is a challenging task in drug development and clinical application. Due to the extremely large complete set of all possible DDIs, computer-aided DDIs prediction methods are getting lots of attention in the pharmaceutical industry and academia. However, most existing computational methods only use single perspective information and few of them conduct the task based on the biomedical knowledge graph (BKG), which can provide more detailed and comprehensive drug lateral side information flow. To this end, a deep learning framework, namely DeepLGF, is proposed to fully exploit BKG fusing local-global information to improve the performance of DDIs prediction. More specifically, DeepLGF first obtains chemical local information on drug sequence semantics through a natural language processing algorithm. Then a model of BFGNN based on graph neural network is proposed to extract biological local information on drug through learning embedding vector from different biological functional spaces. The global feature information is extracted from the BKG by our knowledge graph embedding method. In DeepLGF, for fusing local-global features well, we designed four aggregating methods to explore the most suitable ones. Finally, the advanced fusing feature vectors are fed into deep neural network to train and predict. To evaluate the prediction performance of DeepLGF, we tested our method in three prediction tasks and compared it with state-of-the-art models. In addition, case studies of three cancer-related and COVID-19-related drugs further demonstrated DeepLGF's superior ability for potential DDIs prediction. The webserver of the DeepLGF predictor is freely available at http://120.77.11.78/DeepLGF/.
Collapse
Affiliation(s)
- Zhong-Hao Ren
- School of Information Engineering, Xijing University, Xi'an 710100, China.,School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi'an 710100, China
| | - Li-Ping Li
- College of Grassland and Environment Sciences, Xinjiang Agricultural University, Urumqi 830052, China
| | - Yong-Jian Guan
- School of Information Engineering, Xijing University, Xi'an 710100, China
| | - Lu-Xiang Guo
- School of Information Engineering, Xijing University, Xi'an 710100, China
| | - Jie Pan
- School of Information Engineering, Xijing University, Xi'an 710100, China
| |
Collapse
|
25
|
Huang L, Zhang L, Chen X. Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational models. Brief Bioinform 2022; 23:6686738. [PMID: 36056743 DOI: 10.1093/bib/bbac358] [Citation(s) in RCA: 68] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/24/2022] [Accepted: 07/30/2022] [Indexed: 12/12/2022] Open
Abstract
Since the problem proposed in late 2000s, microRNA-disease association (MDA) predictions have been implemented based on the data fusion paradigm. Integrating diverse data sources gains a more comprehensive research perspective, and brings a challenge to algorithm design for generating accurate, concise and consistent representations of the fused data. After more than a decade of research progress, a relatively simple algorithm like the score function or a single computation layer may no longer be sufficient for further improving predictive performance. Advanced model design has become more frequent in recent years, particularly in the form of reasonably combing multiple algorithms, a process known as model fusion. In the current review, we present 29 state-of-the-art models and introduce the taxonomy of computational models for MDA prediction based on model fusion and non-fusion. The new taxonomy exhibits notable changes in the algorithmic architecture of models, compared with that of earlier ones in the 2017 review by Chen et al. Moreover, we discuss the progresses that have been made towards overcoming the obstacles to effective MDA prediction since 2017 and elaborated on how future models can be designed according to a set of new schemas. Lastly, we analysed the strengths and weaknesses of each model category in the proposed taxonomy and proposed future research directions from diverse perspectives for enhancing model performance.
Collapse
Affiliation(s)
- Li Huang
- Academy of Arts and Design, Tsinghua University, Beijing, 10084, China.,The Future Laboratory, Tsinghua University, Beijing, 10084, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China.,Artificial Intelligence Research Institute, China University of Mining and Technology, Xuzhou, 221116, China
| |
Collapse
|
26
|
Hu S, Zhang B, Lv H, Chang F, Zhou C, Wu L, Zou G. Improving Network Representation Learning via Dynamic Random Walk, Self-Attention and Vertex Attributes-Driven Laplacian Space Optimization. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1213. [PMID: 36141099 PMCID: PMC9498033 DOI: 10.3390/e24091213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/25/2022] [Accepted: 08/26/2022] [Indexed: 06/16/2023]
Abstract
Network data analysis is a crucial method for mining complicated object interactions. In recent years, random walk and neural-language-model-based network representation learning (NRL) approaches have been widely used for network data analysis. However, these NRL approaches suffer from the following deficiencies: firstly, because the random walk procedure is based on symmetric node similarity and fixed probability distribution, the sampled vertices' sequences may lose local community structure information; secondly, because the feature extraction capacity of the shallow neural language model is limited, they can only extract the local structural features of networks; and thirdly, these approaches require specially designed mechanisms for different downstream tasks to integrate vertex attributes of various types. We conducted an in-depth investigation to address the aforementioned issues and propose a novel general NRL framework called dynamic structure and vertex attribute fusion network embedding, which firstly defines an asymmetric similarity and h-hop dynamic random walk strategy to guide the random walk process to preserve the network's local community structure in walked vertex sequences. Next, we train a self-attention-based sequence prediction model on the walked vertex sequences to simultaneously learn the vertices' local and global structural features. Finally, we introduce an attributes-driven Laplacian space optimization to converge the process of structural feature extraction and attribute feature extraction. The proposed approach is exhaustively evaluated by means of node visualization and classification on multiple benchmark datasets, and achieves superior results compared to baseline approaches.
Collapse
Affiliation(s)
- Shengxiang Hu
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| | - Bofeng Zhang
- School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai 201209, China
- School of Computer Science and Technology, Kashi University, Kashi 844008, China
| | - Hehe Lv
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| | - Furong Chang
- School of Information Engineering, Yangzhou Polytechnic Institute, Yangzhou 225127, China
| | - Chenyang Zhou
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| | - Liangrui Wu
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| | - Guobing Zou
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
| |
Collapse
|
27
|
Chen J, Gong Z, Wang W, Liu W, Dong X. CRL: Collaborative Representation Learning by Coordinating Topic Modeling and Network Embeddings. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3765-3777. [PMID: 33566768 DOI: 10.1109/tnnls.2021.3054422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Network representation learning (NRL) has shown its effectiveness in many tasks, such as vertex classification, link prediction, and community detection. In many applications, vertices of social networks contain textual information, e.g., citation networks, which form a text corpus and can be applied to the typical representation learning methods. The global context in the text corpus can be utilized by topic models to discover the topic structures of vertices. Nevertheless, most existing NRL approaches focus on learning representations from the local neighbors of vertices and ignore the global structure of the associated textual information in networks. In this article, we propose a unified model based on matrix factorization (MF), named collaborative representation learning (CRL), which: 1) considers complementary global and local information simultaneously and 2) models topics and learns network embeddings collaboratively. Moreover, we incorporate the Fletcher-Reeves (FR) MF, a conjugate gradient method, to optimize the embedding matrices in an alternative mode. We call this parameter learning method as AFR in our work that can achieve convergence after a few numbers of iterations. Also, by evaluating CRL on topic coherence and vertex classification using several real-world data sets, our experimental study shows that this collaborative model not only can improve the performance of topic discovery over the baseline topic models but also can learn better network representations than the state-of-the-art context-aware NRL models.
Collapse
|
28
|
Yang M, Huang ZA, Gu W, Han K, Pan W, Yang X, Zhu Z. Prediction of biomarker-disease associations based on graph attention network and text representation. Brief Bioinform 2022; 23:6651308. [PMID: 35901464 DOI: 10.1093/bib/bbac298] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 06/28/2022] [Accepted: 06/30/2022] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION The associations between biomarkers and human diseases play a key role in understanding complex pathology and developing targeted therapies. Wet lab experiments for biomarker discovery are costly, laborious and time-consuming. Computational prediction methods can be used to greatly expedite the identification of candidate biomarkers. RESULTS Here, we present a novel computational model named GTGenie for predicting the biomarker-disease associations based on graph and text features. In GTGenie, a graph attention network is utilized to characterize diverse similarities of biomarkers and diseases from heterogeneous information resources. Meanwhile, a pretrained BERT-based model is applied to learn the text-based representation of biomarker-disease relation from biomedical literature. The captured graph and text features are then integrated in a bimodal fusion network to model the hybrid entity representation. Finally, inductive matrix completion is adopted to infer the missing entries for reconstructing relation matrix, with which the unknown biomarker-disease associations are predicted. Experimental results on HMDD, HMDAD and LncRNADisease data sets showed that GTGenie can obtain competitive prediction performance with other state-of-the-art methods. AVAILABILITY The source code of GTGenie and the test data are available at: https://github.com/Wolverinerine/GTGenie.
Collapse
Affiliation(s)
- Minghao Yang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China
| | - Zhi-An Huang
- Center for Computer Science and Information Technology, City University of Hong Kong Dongguan Research Institute, Dongguan, China
| | - Wenhao Gu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China.,GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Kun Han
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Wenying Pan
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Xiao Yang
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China
| |
Collapse
|
29
|
Luo Q, Yu D, Maradapu Vera Venkata Sai A, Cai Z, Cheng X. A survey of structural representation learning for social networks. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
30
|
Luo J, Ouyang W, Shen C, Cai J. Multi-relation graph embedding for predicting miRNA-target gene interactions by integrating gene sequence information. IEEE J Biomed Health Inform 2022; 26:4345-4353. [PMID: 35439150 DOI: 10.1109/jbhi.2022.3168008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Accumulated studies have found that miRNAs are in charge of many complex diseases such as cancers by modulating gene expression. Predicting miRNA-target interactions is beneficial for uncovering the crucial roles of miRNAs in regulating target genes and the progression of diseases. The emergence of large-scale genomic and biological data as well as the recent development in heterogeneous networks provides new opportunities for miRNA target identification. Compared with conventional methods, computational methods become a decent solution for high efficiency. Thus, designing a method that could excavate valid information from the heterogeneous network and gene sequences is in great demand for improving the prediction accuracy. In this study, we proposed a graph-based model named MRMTI for the prediction of miRNA-target interactions. MRMTI utilized the multi-relation graph convolution module and the Bi-LSTM module to incorporate both network topology and sequential information. The learned embeddings of miRNAs and genes were then used to calculate the prediction scores of miRNA-target pairs. Comparisons with other state-of-the-art graph embedding methods and existing bioinformatic tools illustrated the superiority of MRMTI under multiple criteria metrics. Three variants of MRMTI implied the positive effect of multi-relation. The experimental results of case studies further demonstrated the prominent ability of MRMTI in predicting novel associations.
Collapse
|
31
|
Krämer A, Green J, Billaud JN, Pasare NA, Jones M, Tugendreich S. Mining hidden knowledge: embedding models of cause-effect relationships curated from the biomedical literature. BIOINFORMATICS ADVANCES 2022; 2:vbac022. [PMID: 36699407 PMCID: PMC9710590 DOI: 10.1093/bioadv/vbac022] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/04/2022] [Accepted: 04/06/2022] [Indexed: 01/28/2023]
Abstract
Motivation We explore the use of literature-curated signed causal gene expression and gene-function relationships to construct unsupervised embeddings of genes, biological functions and diseases. Our goal is to prioritize and predict activating and inhibiting functional associations of genes and to discover hidden relationships between functions. As an application, we are particularly interested in the automatic construction of networks that capture relevant biology in a given disease context. Results We evaluated several unsupervised gene embedding models leveraging literature-curated signed causal gene expression findings. Using linear regression, we show that, based on these gene embeddings, gene-function relationships can be predicted with about 95% precision for the highest scoring genes. Function embedding vectors, derived from parameters of the linear regression model, allow inference of relationships between different functions or diseases. We show for several diseases that gene and function embeddings can be used to recover key drivers of pathogenesis, as well as underlying cellular and physiological processes. These results are presented as disease-centric networks of genes and functions. To illustrate the applicability of our approach to other machine learning tasks, we also computed embeddings for drug molecules, which were then tested using a simple neural network to predict drug-disease associations. Availability and implementation Python implementations of the gene and function embedding algorithms operating on a subset of our literature-curated content as well as other code used for this paper are made available as part of the Supplementary data. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Jeff Green
- QIAGEN Digital Insights, Redwood City, CA 94063, USA
| | | | | | - Martin Jones
- QIAGEN Digital Insights, Redwood City, CA 94063, USA
| | | |
Collapse
|
32
|
Shao K, Zhang Y, Wen Y, Zhang Z, He S, Bo X. DTI-HETA: prediction of drug-target interactions based on GCN and GAT on heterogeneous graph. Brief Bioinform 2022; 23:6563180. [PMID: 35380622 DOI: 10.1093/bib/bbac109] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 02/14/2022] [Accepted: 03/03/2022] [Indexed: 12/19/2022] Open
Abstract
Drug-target interaction (DTI) prediction plays an important role in drug repositioning, drug discovery and drug design. However, due to the large size of the chemical and genomic spaces and the complex interactions between drugs and targets, experimental identification of DTIs is costly and time-consuming. In recent years, the emerging graph neural network (GNN) has been applied to DTI prediction because DTIs can be represented effectively using graphs. However, some of these methods are only based on homogeneous graphs, and some consist of two decoupled steps that cannot be trained jointly. To further explore GNN-based DTI prediction by integrating heterogeneous graph information, this study regards DTI prediction as a link prediction problem and proposes an end-to-end model based on HETerogeneous graph with Attention mechanism (DTI-HETA). In this model, a heterogeneous graph is first constructed based on the drug-drug and target-target similarity matrices and the DTI matrix. Then, the graph convolutional neural network is utilized to obtain the embedded representation of the drugs and targets. To highlight the contribution of different neighborhood nodes to the central node in aggregating the graph convolution information, a graph attention mechanism is introduced into the node embedding process. Afterward, an inner product decoder is applied to predict DTIs. To evaluate the performance of DTI-HETA, experiments are conducted on two datasets. The experimental results show that our model is superior to the state-of-the-art methods. Also, the identification of novel DTIs indicates that DTI-HETA can serve as a powerful tool for integrating heterogeneous graph information to predict DTIs.
Collapse
Affiliation(s)
| | | | - Yuqi Wen
- Beijing Institute of Radiation Medicine, Beijing, China
| | | | - Song He
- Beijing Institute of Radiation Medicine, Beijing, China
| | - Xiaochen Bo
- Beijing Institute of Radiation Medicine, Beijing, China
| |
Collapse
|
33
|
Shang Y, Ye X, Futamura Y, Yu L, Sakurai T. Multiview network embedding for drug-target Interactions prediction by consistent and complementary information preserving. Brief Bioinform 2022; 23:6544850. [PMID: 35262678 DOI: 10.1093/bib/bbac059] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 02/01/2022] [Accepted: 02/06/2022] [Indexed: 01/02/2023] Open
Abstract
Accurate prediction of drug-target interactions (DTIs) can reduce the cost and time of drug repositioning and drug discovery. Many current methods integrate information from multiple data sources of drug and target to improve DTIs prediction accuracy. However, these methods do not consider the complex relationship between different data sources. In this study, we propose a novel computational framework, called MccDTI, to predict the potential DTIs by multiview network embedding, which can integrate the heterogenous information of drug and target. MccDTI learns high-quality low-dimensional representations of drug and target by preserving the consistent and complementary information between multiview networks. Then MccDTI adopts matrix completion scheme for DTIs prediction based on drug and target representations. Experimental results on two datasets show that the prediction accuracy of MccDTI outperforms four state-of-the-art methods for DTIs prediction. Moreover, literature verification for DTIs prediction shows that MccDTI can predict the reliable potential DTIs. These results indicate that MccDTI can provide a powerful tool to predict new DTIs and accelerate drug discovery. The code and data are available at: https://github.com/ShangCS/MccDTI.
Collapse
Affiliation(s)
- Yifan Shang
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Yasunori Futamura
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, 710071, China
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| |
Collapse
|
34
|
Li J, Wang J, Lv H, Zhang Z, Wang Z. IMCHGAN: Inductive Matrix Completion With Heterogeneous Graph Attention Networks for Drug-Target Interactions Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:655-665. [PMID: 34115592 DOI: 10.1109/tcbb.2021.3088614] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Identification of targets among known drugs plays an important role in drug repurposing and discovery. Computational approaches for prediction of drug-target interactions (DTIs)are highly desired in comparison to traditional biological experiments as its fast and low price. Moreover, recent advances of systems biology approaches have generated large-scale heterogeneous, biological information networks data, which offer opportunities for machine learning-based identification of DTIs. We present a novel Inductive Matrix Completion with Heterogeneous Graph Attention Network approach (IMCHGAN)for predicting DTIs. IMCHGAN first adopts a two-level neural attention mechanism approach to learn drug and target latent feature representations from the DTI heterogeneous network respectively. Then, the learned latent features are fed into the Inductive Matrix Completion (IMC)prediction score model which computes the best projection from drug space onto target space and output DTI score via the inner product of projected drug and target feature representations. IMCHGAN is an end-to-end neural network learning framework where the parameters of both the prediction score model and the feature representation learning model are simultaneously optimized via backpropagation under supervising of the observed known drug-target interactions data. We compare IMCHGAN with other state-of-the-art baselines on two real DTI experimental datasets. The results show that our method is superior to existing methods in term of AUC and AUPR. Moreover, IMCHGAN also shows it has strong predictive power for novel (unknown)DTIs. All datasets and code can be obtained from https://github.com/ljatynu/IMCHGAN/.
Collapse
|
35
|
Wang S, Li J, Wang Y. M2PP: a novel computational model for predicting drug-targeted pathogenic proteins. BMC Bioinformatics 2022; 23:7. [PMID: 34983358 PMCID: PMC8728953 DOI: 10.1186/s12859-021-04522-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 12/07/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detecting pathogenic proteins is the origin way to understand the mechanism and resist the invasion of diseases, making pathogenic protein prediction develop into an urgent problem to be solved. Prediction for genome-wide proteins may be not necessarily conducive to rapidly cure diseases as developing new drugs specifically for the predicted pathogenic protein always need major expenditures on time and cost. In order to facilitate disease treatment, computational method to predict pathogenic proteins which are targeted by existing drugs should be exploited. RESULTS In this study, we proposed a novel computational model to predict drug-targeted pathogenic proteins, named as M2PP. Three types of features were presented on our constructed heterogeneous network (including target proteins, diseases and drugs), which were based on the neighborhood similarity information, drug-inferred information and path information. Then, a random forest regression model was trained to score unconfirmed target-disease pairs. Five-fold cross-validation experiment was implemented to evaluate model's prediction performance, where M2PP achieved advantageous results compared with other state-of-the-art methods. In addition, M2PP accurately predicted high ranked pathogenic proteins for common diseases with public biomedical literature as supporting evidence, indicating its excellent ability. CONCLUSIONS M2PP is an effective and accurate model to predict drug-targeted pathogenic proteins, which could provide convenience for the future biological researches.
Collapse
Affiliation(s)
- Shiming Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China
| | - Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China.
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China.
| |
Collapse
|
36
|
Predicting miRNA-Disease Association Based on Neural Inductive Matrix Completion with Graph Autoencoders and Self-Attention Mechanism. Biomolecules 2022; 12:biom12010064. [PMID: 35053212 PMCID: PMC8774034 DOI: 10.3390/biom12010064] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 12/29/2021] [Accepted: 12/31/2021] [Indexed: 02/06/2023] Open
Abstract
Many studies have clarified that microRNAs (miRNAs) are associated with many human diseases. Therefore, it is essential to predict potential miRNA-disease associations for disease pathogenesis and treatment. Numerous machine learning and deep learning approaches have been adopted to this problem. In this paper, we propose a Neural Inductive Matrix completion-based method with Graph Autoencoders (GAE) and Self-Attention mechanism for miRNA-disease associations prediction (NIMGSA). Some of the previous works based on matrix completion ignore the importance of label propagation procedure for inferring miRNA-disease associations, while others cannot integrate matrix completion and label propagation effectively. Varying from previous studies, NIMGSA unifies inductive matrix completion and label propagation via neural network architecture, through the collaborative training of two graph autoencoders. This neural inductive matrix completion-based method is also an implementation of self-attention mechanism for miRNA-disease associations prediction. This end-to-end framework can strengthen the robustness and preciseness of both matrix completion and label propagation. Cross validations indicate that NIMGSA outperforms current miRNA-disease prediction methods. Case studies demonstrate that NIMGSA is competent in detecting potential miRNA-disease associations.
Collapse
|
37
|
Du J, Lin D, Yuan R, Chen X, Liu X, Yan J. Graph Embedding Based Novel Gene Discovery Associated With Diabetes Mellitus. Front Genet 2021; 12:779186. [PMID: 34899863 PMCID: PMC8657768 DOI: 10.3389/fgene.2021.779186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 10/20/2021] [Indexed: 11/25/2022] Open
Abstract
Diabetes mellitus is a group of complex metabolic disorders which has affected hundreds of millions of patients world-widely. The underlying pathogenesis of various types of diabetes is still unclear, which hinders the way of developing more efficient therapies. Although many genes have been found associated with diabetes mellitus, more novel genes are still needed to be discovered towards a complete picture of the underlying mechanism. With the development of complex molecular networks, network-based disease-gene prediction methods have been widely proposed. However, most existing methods are based on the hypothesis of guilt-by-association and often handcraft node features based on local topological structures. Advances in graph embedding techniques have enabled automatically global feature extraction from molecular networks. Inspired by the successful applications of cutting-edge graph embedding methods on complex diseases, we proposed a computational framework to investigate novel genes associated with diabetes mellitus. There are three main steps in the framework: network feature extraction based on graph embedding methods; feature denoising and regeneration using stacked autoencoder; and disease-gene prediction based on machine learning classifiers. We compared the performance by using different graph embedding methods and machine learning classifiers and designed the best workflow for predicting genes associated with diabetes mellitus. Functional enrichment analysis based on Human Phenotype Ontology (HPO), KEGG, and GO biological process and publication search further evaluated the predicted novel genes.
Collapse
Affiliation(s)
| | | | | | | | | | - Jing Yan
- Zhejiang Hospital, Hangzhou, China.,Zhejiang Provincial Key Lab of Geriatrics, Zhejiang Hospital, Hangzhou, China
| |
Collapse
|
38
|
Amara A, Hadj Taieb MA, Ben Aouicha M. Network representation learning systematic review: Ancestors and current development state. MACHINE LEARNING WITH APPLICATIONS 2021. [DOI: 10.1016/j.mlwa.2021.100130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
39
|
Ou-Yang L, Lu F, Zhang ZC, Wu M. Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey. Brief Bioinform 2021; 23:6447434. [PMID: 34864871 DOI: 10.1093/bib/bbab479] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/25/2021] [Accepted: 10/18/2021] [Indexed: 02/02/2023] Open
Abstract
Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China.,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen,518172, China
| | - Fan Lu
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Zi-Chao Zhang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| |
Collapse
|
40
|
Ruan D, Ji S, Yan C, Zhu J, Zhao X, Yang Y, Gao Y, Zou C, Dai Q. Exploring complex and heterogeneous correlations on hypergraph for the prediction of drug-target interactions. PATTERNS 2021; 2:100390. [PMID: 34950907 PMCID: PMC8672193 DOI: 10.1016/j.patter.2021.100390] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 07/23/2021] [Accepted: 10/21/2021] [Indexed: 01/04/2023]
Abstract
The continuous emergence of drug-target interaction data provides an opportunity to construct a biological network for systematically discovering unknown interactions. However, this is challenging due to complex and heterogeneous correlations between drug and target. Here, we describe a heterogeneous hypergraph-based framework for drug-target interaction (HHDTI) predictions by modeling biological networks through a hypergraph, where each vertex represents a drug or a target and a hyperedge indicates existing similar interactions or associations between the connected vertices. The hypergraph is then trained to generate suitably structured embeddings for discovering unknown interactions. Comprehensive experiments performed on four public datasets demonstrate that HHDTI achieves significant and consistently improved predictions compared with state-of-the-art methods. Our analysis indicates that this superior performance is due to the ability to integrate heterogeneous high-order information from the hypergraph learning. These results suggest that HHDTI is a scalable and practical tool for uncovering novel drug-target interactions. A hypergraph framework to model high-order correlations in heterogenous biological network An embedding learning method for drugs and targets using hypergraphs High-order correlation between drugs and targets can contribute to DTI predictions
The prediction of drug-target interactions (DTIs) plays a crucial role in drug discovery. In this work, we discover that the high-order correlations in heterogeneous biological networks are essential for DTI predictions. The hypergraph structure is ultilized to model the high-order correlations in the biological networks, then the embeddings are generated for the drugs and targets, respectively. Finally, the interaction between them can be predicted according to the similarity of the embeddings. Our proposed method has been evaluated on multiple public datasets and the improved performance demonstrates that the high-order correlations among drugs and targets contribute significantly on DTI predictions, and other associations besides DTIs are also useful in this task. Our method can also be used in other scenarios containing complex correlations.
Collapse
Affiliation(s)
- Ding Ruan
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
| | - Shuyi Ji
- School of Software, KLISS, BNRist, Tsinghua University, Beijing, China
- Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China
| | - Chenggang Yan
- School of Automation, Hangzhou Dianzi University, Hangzhou, China
| | - Junjie Zhu
- School of Software, KLISS, BNRist, Tsinghua University, Beijing, China
| | - Xibin Zhao
- School of Software, KLISS, BNRist, Tsinghua University, Beijing, China
| | - Yuedong Yang
- School of Computer Science, Sun Yat-sen University, Guangzhou, China
| | - Yue Gao
- School of Software, KLISS, BNRist, Tsinghua University, Beijing, China
- Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China
- Corresponding author
| | - Changqing Zou
- Huawei Vancouver Research Center, Huawei Canada Technologies, Vancouver, Canada
- Corresponding author
| | - Qionghai Dai
- Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China
- Department of Automation, Tsinghua University, Beijing, China
- Corresponding author
| |
Collapse
|
41
|
Abstract
Link prediction is a paradigmatic problem in network science, which aims at estimating the existence likelihoods of nonobserved links, based on known topology. After a brief introduction of the standard problem and evaluation metrics of link prediction, this review will summarize representative progresses about local similarity indices, link predictability, network embedding, matrix completion, ensemble learning, and some others, mainly extracted from related publications in the last decade. Finally, this review will outline some long-standing challenges for future studies.
Collapse
Affiliation(s)
- Tao Zhou
- CompleX Lab, University of Electronic Science and Technology of China, Chengdu 611731, People’s Republic of China
| |
Collapse
|
42
|
Cao H, Zhang L, Jin B, Cheng S, Wei X, Che C. Enriching limited information on rare diseases from heterogeneous networks for drug repositioning. BMC Med Inform Decis Mak 2021; 21:304. [PMID: 34789254 PMCID: PMC8596891 DOI: 10.1186/s12911-021-01664-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 10/11/2021] [Indexed: 11/10/2022] Open
Abstract
Background The historical data of rare disease is very scarce in reality, so how to perform drug repositioning for the rare disease is a great challenge. Most existing methods of drug repositioning for the rare disease usually neglect father–son information, so it is extremely difficult to predict drugs for the rare disease. Method In this paper, we focus on father–son information mining for the rare disease. We propose GRU-Cooperation-Attention-Network (GCAN) to predict drugs for the rare disease. We construct two heterogeneous networks for information enhancement, one network contains the father-nodes of the rare disease and the other network contains the son-nodes information. To bridge two heterogeneous networks, we set a mapping to connect them. What’s more, we use the biased random walk mechanism to collect the information smoothly from two heterogeneous networks, and employ a cooperation attention mechanism to enhance repositioning ability of the network. Result Comparing with traditional methods, GCAN makes full use of father–son information. The experimental results on real drug data from hospitals show that GCAN outperforms state-of-the-art machine learning methods for drug repositioning. Conclusion The performance of GCAN for drug repositioning is mainly limited by the insufficient scale and poor quality of the data. In future research work, we will focus on how to utilize more data such as drug molecule information and protein molecule information for the drug repositioning of the rare disease.
Collapse
Affiliation(s)
- Hongkui Cao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116622, China
| | - Liang Zhang
- International Business College, Dongbei University of Finance and Economics, Dalian, 116025, China
| | - Bo Jin
- School of Innovaton and Entrepreneurship, Dalian University of Technology, Dalian, 116024, China
| | - Shicheng Cheng
- School of Innovaton and Entrepreneurship, Dalian University of Technology, Dalian, 116024, China
| | - Xiaopeng Wei
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116622, China.,School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Chao Che
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116622, China.
| |
Collapse
|
43
|
Zhang X, Wang W, Ren CX, Dai DQ. Learning representation for multiple biological networks via a robust graph regularized integration approach. Brief Bioinform 2021; 23:6381251. [PMID: 34607360 DOI: 10.1093/bib/bbab409] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 08/23/2021] [Accepted: 09/06/2021] [Indexed: 01/18/2023] Open
Abstract
Learning node representation is a fundamental problem in biological network analysis, as compact representation features reveal complicated network structures and carry useful information for downstream tasks such as link prediction and node classification. Recently, multiple networks that profile objects from different aspects are increasingly accumulated, providing the opportunity to learn objects from multiple perspectives. However, the complex common and specific information across different networks pose challenges to node representation methods. Moreover, ubiquitous noise in networks calls for more robust representation. To deal with these problems, we present a representation learning method for multiple biological networks. First, we accommodate the noise and spurious edges in networks using denoised diffusion, providing robust connectivity structures for the subsequent representation learning. Then, we introduce a graph regularized integration model to combine refined networks and compute common representation features. By using the regularized decomposition technique, the proposed model can effectively preserve the common structural property of different networks and simultaneously accommodate their specific information, leading to a consistent representation. A simulation study shows the superiority of the proposed method on different levels of noisy networks. Three network-based inference tasks, including drug-target interaction prediction, gene function identification and fine-grained species categorization, are conducted using representation features learned from our method. Biological networks at different scales and levels of sparsity are involved. Experimental results on real-world data show that the proposed method has robust performance compared with alternatives. Overall, by eliminating noise and integrating effectively, the proposed method is able to learn useful representations from multiple biological networks.
Collapse
Affiliation(s)
- Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Chuan-Xian Ren
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| |
Collapse
|
44
|
Basher ARMA, Mclaughlin RJ, Hallam SJ. Metabolic Pathway Prediction Using Non-Negative Matrix Factorization with Improved Precision. J Comput Biol 2021; 28:1075-1103. [PMID: 34520674 DOI: 10.1089/cmb.2021.0258] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges, including pathway features engineering, multiple mapping of enzymatic reactions, and emergent or distributed metabolism within populations or communities of cells, can limit prediction performance. In this article, we present triUMPF (triple non-negative matrix factorization [NMF] with community detection for metabolic pathway inference), which combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract a higher-order structure based on the clustering of vertices that share similar statistical properties. We evaluated triUMPF performance by using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.
Collapse
Affiliation(s)
- Abdur Rahman M A Basher
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada
| | - Ryan J Mclaughlin
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada
| | - Steven J Hallam
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada.,Department of Microbiology & Immunology, University of British Columbia, Vancouver, British Columbia, Canada.,Genome Science and Technology Program, University of British Columbia, Vancouver, British Columbia, Canada.,Life Sciences Institute, University of British Columbia, Vancouver, British Columbia, Canada.,ECOSCOPE Training Program, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
45
|
Yang H, Ding Y, Tang J, Guo F. Identifying potential association on gene-disease network via dual hypergraph regularized least squares. BMC Genomics 2021; 22:605. [PMID: 34372777 PMCID: PMC8351363 DOI: 10.1186/s12864-021-07864-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 06/29/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Identifying potential associations between genes and diseases via biomedical experiments must be the time-consuming and expensive research works. The computational technologies based on machine learning models have been widely utilized to explore genetic information related to complex diseases. Importantly, the gene-disease association detection can be defined as the link prediction problem in bipartite network. However, many existing methods do not utilize multiple sources of biological information; Additionally, they do not extract higher-order relationships among genes and diseases. RESULTS In this study, we propose a novel method called Dual Hypergraph Regularized Least Squares (DHRLS) with Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL), in order to detect all potential gene-disease associations. First, we construct multiple kernels based on various biological data sources in gene and disease spaces respectively. After that, we use CAK-MKL to obtain the optimal kernels in the two spaces respectively. To specific, hypergraph can be employed to establish higher-order relationships. Finally, our DHRLS model is solved by the Alternating Least squares algorithm (ALSA), for predicting gene-disease associations. CONCLUSION Comparing with many outstanding prediction tools, DHRLS achieves best performance on gene-disease associations network under two types of cross validation. To verify robustness, our proposed approach has excellent prediction performance on six real-world networks. Our research work can effectively discover potential disease-associated genes and provide guidance for the follow-up verification methods of complex diseases.
Collapse
Affiliation(s)
- Hongpeng Yang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China.
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China.
| |
Collapse
|
46
|
Wang X, Yang Y, Li K, Li W, Li F, Peng S. BioERP: biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions. Bioinformatics 2021; 37:4793-4800. [PMID: 34329382 DOI: 10.1093/bioinformatics/btab565] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 07/18/2021] [Accepted: 07/29/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Predicting entity relationship can greatly benefit important biomedical problems. Recently, a large amount of biomedical heterogeneous networks (BioHNs) are generated and offer opportunities for developing network-based learning approaches to predict relationships among entities. However, current researches slightly explored BioHNs-based self-supervised representation learning methods, and are hard to simultaneously capturing local- and global-level association information among entities. RESULTS In this study, we propose a biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions, termed BioERP. A self-supervised meta path detection mechanism is proposed to train a deep Transformer encoder model that can capture the global structure and semantic feature in BioHNs. Meanwhile, a biomedical entity mask learning strategy is designed to reflect local associations of vertices. Finally, the representations from different task models are concatenated to generate two-level representation vectors for predicting relationships among entities. The results on eight datasets show BioERP outperforms 30 state-of-the-art methods. In particular, BioERP reveals great performance with results close to 1 in terms of AUC and AUPR on the drug-target interaction predictions. In summary, BioERP is a promising bio-entity relationship prediction approach. AVAILABILITY Source code and data can be downloaded from https://github.com/pengsl-lab/BioERP.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaoqi Wang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Yaning Yang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Kenli Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Wentao Li
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Fei Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100850, China
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China.,School of Computer Science, National University of Defense Technology, Changsha, 410073, China.,Peng Cheng Lab, Shenzhen 518000, China
| |
Collapse
|
47
|
Shu J, Li Y, Wang S, Xi B, Ma J. Disease gene prediction with privileged information and heteroscedastic dropout. Bioinformatics 2021; 37:i410-i417. [PMID: 34252957 PMCID: PMC8275341 DOI: 10.1093/bioinformatics/btab310] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/24/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation Recently, machine learning models have achieved tremendous success in prioritizing candidate genes for genetic diseases. These models are able to accurately quantify the similarity among disease and genes based on the intuition that similar genes are more likely to be associated with similar diseases. However, the genetic features these methods rely on are often hard to collect due to high experimental cost and various other technical limitations. Existing solutions of this problem significantly increase the risk of overfitting and decrease the generalizability of the models. Results In this work, we propose a graph neural network (GNN) version of the Learning under Privileged Information paradigm to predict new disease gene associations. Unlike previous gene prioritization approaches, our model does not require the genetic features to be the same at training and test stages. If a genetic feature is hard to measure and therefore missing at the test stage, our model could still efficiently incorporate its information during the training process. To implement this, we develop a Heteroscedastic Gaussian Dropout algorithm, where the dropout probability of the GNN model is determined by another GNN model with a mirrored GNN architecture. To evaluate our method, we compared our method with four state-of-the-art methods on the Online Mendelian Inheritance in Man dataset to prioritize candidate disease genes. Extensive evaluations show that our model could improve the prediction accuracy when all the features are available compared to other methods. More importantly, our model could make very accurate predictions when >90% of the features are missing at the test stage. Availability and implementation Our method is realized with Python 3.7 and Pytorch 1.5.0 and method and data are freely available at: https://github.com/juanshu30/Disease-Gene-Prioritization-with-Privileged-Information-and-Heteroscedastic-Dropout.
Collapse
Affiliation(s)
- Juan Shu
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of HongKong, HongKong 999077, China
| | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Bowei Xi
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA
| | - Jianzhu Ma
- Institute for Artificial Intelligence, Peking University, Beijing 100871, China
| |
Collapse
|
48
|
Zhang T, Zhang SW, Li Y. Identifying Driver Genes for Individual Patients through Inductive Matrix Completion. Bioinformatics 2021; 37:4477-4484. [PMID: 34175939 DOI: 10.1093/bioinformatics/btab477] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 04/30/2021] [Accepted: 06/25/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The driver genes play a key role in the evolutionary process of cancer. Effectively identifying these driver genes is crucial to cancer diagnosis and treatment. However, due to the high heterogeneity of cancers, it remains challenging to identify the driver genes for individual patients. Although some computational methods have been proposed to tackle this problem, they seldom consider the fact that the genes functionally similar to the well-established driver genes may likely play similar roles in cancer process, which potentially promotes the driver gene identification. Thus, here we developed a novel approach of IMCDriver to promote the driver gene identification both for cohorts and individual patients. RESULTS IMCDriver first considers the well-established driver genes as prior information, and adopts the using multi-omics data (e.g., somatic mutation, gene expression and protein-protein interaction) to compute the similarity between patients/genes. Then, IMCDriver prioritizes the personalized mutated genes according to their functional similarity to the well-established driver genes via Inductive Matrix Completion. Finally, IMCDriver identifies the highly rank-ordered genes as the personalized driver genes. The results on five cancer datasets from TCGA show that our IMCDriver outperforms other existing state-of-the-art methods both in the cohort and patient-specific driver gene identification. IMCDriver also reveals some novel driver genes that potentially drive cancer development. In addition, even for the driver genes rarely mutated among a population, IMCDriver can still identify them and prioritize them with high priorities. AVAILABILITY Code available at https://github.com/NWPU-903PR/IMCDriver. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tong Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, China Xi'an.,School of Electrical and Mechanical Engineering, Pingdingshan University, Pingdingshan, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, China Xi'an
| | - Yan Li
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, China Xi'an
| |
Collapse
|
49
|
Wang X, Xin B, Tan W, Xu Z, Li K, Li F, Zhong W, Peng S. DeepR2cov: deep representation learning on heterogeneous drug networks to discover anti-inflammatory agents for COVID-19. Brief Bioinform 2021; 22:6296505. [PMID: 34117734 PMCID: PMC8344611 DOI: 10.1093/bib/bbab226] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 05/14/2021] [Accepted: 05/24/2021] [Indexed: 02/06/2023] Open
Abstract
Recent studies have demonstrated that the excessive inflammatory response is an important factor of death in coronavirus disease 2019 (COVID-19) patients. In this study, we propose a deep representation on heterogeneous drug networks, termed DeepR2cov, to discover potential agents for treating the excessive inflammatory response in COVID-19 patients. This work explores the multi-hub characteristic of a heterogeneous drug network integrating eight unique networks. Inspired by the multi-hub characteristic, we design 3 billion special meta paths to train a deep representation model for learning low-dimensional vectors that integrate long-range structure dependency and complex semantic relation among network nodes. Based on the representation vectors and transcriptomics data, we predict 22 drugs that bind to tumor necrosis factor-α or interleukin-6, whose therapeutic associations with the inflammation storm in COVID-19 patients, and molecular binding model are further validated via data from PubMed publications, ongoing clinical trials and a docking program. In addition, the results on five biomedical applications suggest that DeepR2cov significantly outperforms five existing representation approaches. In summary, DeepR2cov is a powerful network representation approach and holds the potential to accelerate treatment of the inflammatory responses in COVID-19 patients. The source code and data can be downloaded from https://github.com/pengsl-lab/DeepR2cov.git.
Collapse
Affiliation(s)
- Xiaoqi Wang
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Bin Xin
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Weihong Tan
- Chinese Academy of Sciences in the College of Chemistry and Chemical Engineering, College of Biology, Hunan University, China
| | - Zhijian Xu
- Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, China
| | - Kenli Li
- College of Computer Science and Electronic Engineering, Hunan University, China
| | - Fei Li
- Computer Network Information Center, Chinese Academy of Sciences, China
| | - Wu Zhong
- National Engineering Research Center for the Emergency Drug, Beijing Institute of Pharmacology and Toxicology, China
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering, Hunan University, China
| |
Collapse
|
50
|
Ding Y, Lei X, Liao B, Wu FX. Predicting miRNA-Disease Associations Based on Multi-View Variational Graph Auto-Encoder with Matrix Factorization. IEEE J Biomed Health Inform 2021; 26:446-457. [PMID: 34111017 DOI: 10.1109/jbhi.2021.3088342] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
MicroRNAs (miRNAs) have been proved to play critical roles in diverse biological processes, including the human disease development process. Exploring the potential associations between miRNAs and diseases can help us better understand complex disease mechanisms. Given that traditional biological experiments are expensive and time-consuming, computational models can serve as efficient means to uncover potential miRNA-disease associations. This study presents a new computational model based on variational graph auto-encoder with matrix factorization (VGAMF) for miRNA-disease association prediction. More specifically, VGAMF first integrates four different types of information about miRNAs into an miRNA comprehensive similarity network and two types of information about diseases into a disease comprehensive similarity network, respectively. Then, VGAMF gets the non-linear representations of miRNAs and diseases, respectively, from those two comprehensive similarity networks with variational graph auto-encoders. Simultaneously, a non-negative matrix factorization is conducted on the miRNA-disease association matrix to get the linear representations of miRNAs and diseases. Finally, a fully connected neural network combines linear and non-linear representations of miRNAs and diseases to get the final predicted association score for all miRNA-disease pairs. In the 10-fold cross-validation experiments, VGAMF achieves an average AUC of 0.9280 on HMDD v2.0 and 0.9470 on HMDD v3.2, which outperforms other competing methods. Besides, the case studies on colon cancer and esophageal cancer further demonstrate the effectiveness of VGAMF in predicting novel miRNA-disease associations.
Collapse
|