1
|
Kim K, Kim J, Kim M, Lee H, Song G. Therapeutic gene target prediction using novel deep hypergraph representation learning. Brief Bioinform 2024; 26:bbaf019. [PMID: 39841592 PMCID: PMC11752618 DOI: 10.1093/bib/bbaf019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Revised: 12/18/2024] [Accepted: 01/07/2025] [Indexed: 01/24/2025] Open
Abstract
Identifying therapeutic genes is crucial for developing treatments targeting genetic causes of diseases, but experimental trials are costly and time-consuming. Although many deep learning approaches aim to identify biomarker genes, predicting therapeutic target genes remains challenging due to the limited number of known targets. To address this, we propose HIT (Hypergraph Interaction Transformer), a deep hypergraph representation learning model that identifies a gene's therapeutic potential, biomarker status, or lack of association with diseases. HIT uses hypergraph structures of genes, ontologies, diseases, and phenotypes, employing attention-based learning to capture complex relationships. Experiments demonstrate HIT's state-of-the-art performance, explainability, and ability to identify novel therapeutic targets.
Collapse
Affiliation(s)
- Kibeom Kim
- Division of Artificial Intelligence, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
| | - Juseong Kim
- Division of Artificial Intelligence, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
| | - Minwook Kim
- Division of Artificial Intelligence, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
| | - Hyewon Lee
- Department of Cardiology, Medical Research Institute, Pusan National University Hospital, 179 Gudeok-ro, Busan 49241, South Korea
- College of Medicine, Pusan National University, 20 Geumo-ro, Yangsan 50612, Gyeongsangnam-do, South Korea
| | - Giltae Song
- Division of Artificial Intelligence, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
- Department of Electrical and Computer Engineering, School of Computer Science and Engineering, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
- Center for Artificial Intelligence Research, Pusan National University, 2 Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, South Korea
| |
Collapse
|
2
|
He F, Liu K, Yang Z, Chen Y, Hammer RD, Xu D, Popescu M. pathCLIP: Detection of Genes and Gene Relations From Biological Pathway Figures Through Image-Text Contrastive Learning. IEEE J Biomed Health Inform 2024; 28:5007-5019. [PMID: 38568768 PMCID: PMC11363067 DOI: 10.1109/jbhi.2024.3383610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2024]
Abstract
In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. Two case studies on extracting pathway information from literature of non-small cell lung cancer and Alzheimer's disease further demonstrate the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.
Collapse
Affiliation(s)
- Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Kai Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Zhiyuan Yang
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Yibo Chen
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Richard D. Hammer
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Dong Xu
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Mihail Popescu
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| |
Collapse
|
3
|
Meng Z, Liu S, Liang S, Jani B, Meng Z. Heterogeneous biomedical entity representation learning for gene-disease association prediction. Brief Bioinform 2024; 25:bbae380. [PMID: 39154194 PMCID: PMC11330343 DOI: 10.1093/bib/bbae380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 05/29/2024] [Accepted: 07/22/2024] [Indexed: 08/19/2024] Open
Abstract
Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
Collapse
Affiliation(s)
- Zhaohan Meng
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| | - Siwei Liu
- School of Natural and Computing Science, University of Aberdeen King’s College, Aberdeen, AB24 3FX, UK
| | - Shangsong Liang
- Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Building 1B, Masdar City, Abu Dhabi 000000, UAE
| | - Bhautesh Jani
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| | - Zaiqiao Meng
- School of Computing Science, University of Glasgow, 18 Lilybank Gardens, Glasgow G12 8RZ, UK
| |
Collapse
|
4
|
Alvarez-Mamani E, Dechant R, Beltran-Castañón CA, Ibáñez AJ. Graph embedding on mass spectrometry- and sequencing-based biomedical data. BMC Bioinformatics 2024; 25:1. [PMID: 38166530 PMCID: PMC10763173 DOI: 10.1186/s12859-023-05612-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/11/2023] [Indexed: 01/04/2024] Open
Abstract
Graph embedding techniques are using deep learning algorithms in data analysis to solve problems of such as node classification, link prediction, community detection, and visualization. Although typically used in the context of guessing friendships in social media, several applications for graph embedding techniques in biomedical data analysis have emerged. While these approaches remain computationally demanding, several developments over the last years facilitate their application to study biomedical data and thus may help advance biological discoveries. Therefore, in this review, we discuss the principles of graph embedding techniques and explore the usefulness for understanding biological network data derived from mass spectrometry and sequencing experiments, the current workhorses of systems biology studies. In particular, we focus on recent examples for characterizing protein-protein interaction networks and predicting novel drug functions.
Collapse
Affiliation(s)
- Edwin Alvarez-Mamani
- Engineering Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
| | - Reinhard Dechant
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Calico Life Sciences, 1170 Veterans Blvd, San Francisco, CA, 94080, USA
| | | | - Alfredo J Ibáñez
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
- Science Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
| |
Collapse
|
5
|
Shi W, Feng H, Li J, Liu T, Liu Z. DapBCH: a disease association prediction model Based on Cross-species and Heterogeneous graph embedding. Front Genet 2023; 14:1222346. [PMID: 37811150 PMCID: PMC10556742 DOI: 10.3389/fgene.2023.1222346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Accepted: 09/11/2023] [Indexed: 10/10/2023] Open
Abstract
The study of comorbidity can provide new insights into the pathogenesis of the disease and has important economic significance in the clinical evaluation of treatment difficulty, medical expenses, length of stay, and prognosis of the disease. In this paper, we propose a disease association prediction model DapBCH, which constructs a cross-species biological network and applies heterogeneous graph embedding to predict disease association. First, we combine the human disease-gene network, mouse gene-phenotype network, human-mouse homologous gene network, and human protein-protein interaction network to reconstruct a heterogeneous biological network. Second, we apply heterogeneous graph embedding based on meta-path aggregation to generate the feature vector of disease nodes. Finally, we employ link prediction to obtain the similarity of disease pairs. The experimental results indicate that our model is highly competitive in predicting the disease association and is promising for finding potential disease associations.
Collapse
Affiliation(s)
- Wanqi Shi
- School of Mathematics and Computer Science, Zhejiang A & F University, Hangzhou, Zhejiang, China
| | - Hailin Feng
- School of Mathematics and Computer Science, Zhejiang A & F University, Hangzhou, Zhejiang, China
| | - Jian Li
- School of Mathematics and Computer Science, Zhejiang A & F University, Hangzhou, Zhejiang, China
| | - Tongcun Liu
- School of Mathematics and Computer Science, Zhejiang A & F University, Hangzhou, Zhejiang, China
| | - Zhe Liu
- College of Media Engineering, Zhejiang University of Media and Communications, Hangzhou, Zhejiang, China
| |
Collapse
|
6
|
Wang Z, Gu Y, Zheng S, Yang L, Li J. MGREL: A multi-graph representation learning-based ensemble learning method for gene-disease association prediction. Comput Biol Med 2023; 155:106642. [PMID: 36805231 DOI: 10.1016/j.compbiomed.2023.106642] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/15/2023] [Accepted: 02/05/2023] [Indexed: 02/12/2023]
Abstract
The identification of gene-disease associations plays an important role in the exploration of pathogenic mechanisms and therapeutic targets. Computational methods have been regarded as an effective way to discover the potential gene-disease associations in recent years. However, most of them ignored the combination of abundant genetic, therapeutic information, and gene-disease network topology. To this end, we re-organized the current gene-disease association benchmark dataset by extracting the newest gene-disease associations from the OMIM database. Then, we developed a multi-graph representation learning-based ensemble model, named MGREL to predict gene-disease associations. MGREL integrated two feature generation channels to extract gene and disease features, including a knowledge extraction channel which learned high-order representations from genetic and therapeutic information, and a graph learning channel which acquired network topological representations through multiple advanced graph representation learning methods. Then, an ensemble learning method with 5 machine learning models was used as the classifier to predict the gene-disease association. Comprehensive experiments have demonstrated the significant performance achieved by MGREL compared to 5 state-of-the-art methods. For the major measurements (AUC = 0.925, AUPR = 0.935), the relative improvements of MGREL compared to the suboptimal methods are 3.24%, and 2.75%, respectively. MGREL also achieved impressive improvements in the challenging tasks of predicting potential associations for unknown genes/diseases. In addition, case studies implied potential applications for MGREL in the discovery of potential therapeutic targets.
Collapse
Affiliation(s)
- Ziyang Wang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Yaowen Gu
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Si Zheng
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China; Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, 100084, China
| | - Lin Yang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Jiao Li
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China.
| |
Collapse
|
7
|
Zhang Y, Xiang J, Tang L, Yang J, Li J. PGAGP: Predicting pathogenic genes based on adaptive network embedding algorithm. Front Genet 2023; 13:1087784. [PMID: 36744177 PMCID: PMC9895109 DOI: 10.3389/fgene.2022.1087784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 12/09/2022] [Indexed: 01/21/2023] Open
Abstract
The study of disease-gene associations is an important topic in the field of computational biology. The accumulation of massive amounts of biomedical data provides new possibilities for exploring potential relations between diseases and genes through computational strategy, but how to extract valuable information from the data to predict pathogenic genes accurately and rapidly is currently a challenging and meaningful task. Therefore, we present a novel computational method called PGAGP for inferring potential pathogenic genes based on an adaptive network embedding algorithm. The PGAGP algorithm is to first extract initial features of nodes from a heterogeneous network of diseases and genes efficiently and effectively by Gaussian random projection and then optimize the features of nodes by an adaptive refining process. These low-dimensional features are used to improve the disease-gene heterogenous network, and we apply network propagation to the improved heterogenous network to predict pathogenic genes more effectively. By a series of experiments, we study the effect of PGAGP's parameters and integrated strategies on predictive performance and confirm that PGAGP is better than the state-of-the-art algorithms. Case studies show that many of the predicted candidate genes for specific diseases have been implied to be related to these diseases by literature verification and enrichment analysis, which further verifies the effectiveness of PGAGP. Overall, this work provides a useful solution for mining disease-gene heterogeneous network to predict pathogenic genes more effectively.
Collapse
Affiliation(s)
- Yan Zhang
- School of Computer Science and Engineering, Central South University, Changsha, China
- School of Information Science and Engineering, Changsha Medical University, Changsha, China
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Ju Xiang
- School of Computer Science and Engineering, Central South University, Changsha, China
- School of Information Science and Engineering, Changsha Medical University, Changsha, China
- Academician Workstation, Changsha Medical University, Changsha, China
- School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, China
- Department of Basic Medical Sciences and Neuroscience Research Center, Changsha Medical University, Changsha, China
| | - Liang Tang
- Academician Workstation, Changsha Medical University, Changsha, China
- Department of Basic Medical Sciences and Neuroscience Research Center, Changsha Medical University, Changsha, China
| | - Jialiang Yang
- Academician Workstation, Changsha Medical University, Changsha, China
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
- Geneis Beijing Co., Ltd, Beijing, China
| | - Jianming Li
- Academician Workstation, Changsha Medical University, Changsha, China
- Department of Basic Medical Sciences and Neuroscience Research Center, Changsha Medical University, Changsha, China
| |
Collapse
|
8
|
Huang HH, Rao H, Miao R, Liang Y. A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression. BMC Bioinformatics 2022; 23:353. [PMID: 35999505 PMCID: PMC9396780 DOI: 10.1186/s12859-022-04887-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 08/10/2022] [Indexed: 12/22/2022] Open
Abstract
Background Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets. Results Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{\varvec{L}}}_{\mathbf{2}}$$\end{document}L2). The DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathbf{L}}_{\mathbf{2}}$$\end{document}L2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting. Conclusion We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.
Collapse
Affiliation(s)
- Hai-Hui Huang
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Hao Rao
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Rui Miao
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Yong Liang
- The Peng Cheng Laboratory, Shenzhen, China.
| |
Collapse
|