1
|
Xiao Z, Sun H, Wei A, Zhao W, Jiang X. A Novel Framework for Predicting Phage-Host Interactions via Host Specificity-Aware Graph Autoencoder. IEEE J Biomed Health Inform 2025; 29:3069-3078. [PMID: 40030240 DOI: 10.1109/jbhi.2024.3500137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2025]
Abstract
Due to the abuse of antibiotics, some pathogenic bacteria have developed resistance to most antibiotics, leading to the emergence of antibiotic-resistant superbugs. Therefore, researchers resort to phage therapy for bacterial infections. For phage therapy, the fundamental step is to accurately identify phage-host interactions. Although various methods have been proposed, the existing methods suffer from the following two shortcomings: 1) they fail to make full use of genetic information including both genome and protein sequence of phages; 2) host specificity of phages is not explicitly utilized when learning representations of phages and bacteria. In this paper, we present an efficient computational method called PHISGAE for predicting phage-host interactions, in which the host specificity is explicitly employed. Firstly, initial phage-phage connections are efficiently constructed via utilizing phage genome and protein sequence. Then, the refined heterogeneous network is derived by applying K-nearest neighbor strategy, keeping relatively more meaningful local semantics among phages and bacteria. Finally, a host specificity-aware graph autoencoder is proposed to learn high-quality representations of phages and bacteria for predicting phage-host interactions. Experimental results show that PHISGAE outperforms the state-of-the-art methods on predicting phage-host interactions at both species level and genus level (AUC values of 94.73% and 96.32%, respectively). Moreover, results of case study demonstrate that PHISGAE is able to identify candidate hosts with high probability for previously unseen phages identified from metagenomics, effectively predicting potential phage-host interactions in real-world applications.
Collapse
|
2
|
Selote R, Makhijani R. A knowledge graph approach to drug repurposing for Alzheimer's, Parkinson's and Glioma using drug-disease-gene associations. Comput Biol Chem 2025; 115:108302. [PMID: 39693851 DOI: 10.1016/j.compbiolchem.2024.108302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/06/2024] [Accepted: 11/26/2024] [Indexed: 12/20/2024]
Abstract
Drug Repurposing gives us facility to find the new uses of previously developed drugs rather than developing new drugs from start. Particularly during pandemic, drug repurposing caught much attention to provide new applications of the previously approved drugs. In our research, we provide a novel method for drug repurposing based on feature learning process from drug-disease-gene network. In our research, we aimed at finding drug candidates which can be repurposed under neurodegenerative diseases and glioma. We collected association data between drugs, diseases and genes from public resources and primarily examined the data related to Alzheimer's, Parkinson's and Glioma diseases. We created a Knowledge Graph using neo4j by integrating all these datasets and applied scalable feature learning algorithm known as node2vec to create node embeddings. These embeddings were later used to predict the unknown associations between disease and their candidate drugs by finding cosine similarity between disease and drug nodes embedding. We obtained a definitive set of candidate drugs for repurposing. These results were validated from the literature and CodReS online tool to rank the candidate drugs. Additionally, we verified the status of candidate drugs from pharmaceutical knowledge databases to confirm their significance.
Collapse
Affiliation(s)
- Ruchira Selote
- Department of Computer Science and Engineering, Indian Institute of Information Technology, Nagpur, India.
| | - Richa Makhijani
- Department of Computer Science and Engineering, Indian Institute of Information Technology, Nagpur, India.
| |
Collapse
|
3
|
Ye C, Li K, Sun W, Jiang Y, Zhang W, Zhang P, Hu YJ, Han Y, Li L. Biological Prior Knowledge-Embedded Deep Neural Network for Plant Genomic Prediction. Genes (Basel) 2025; 16:411. [PMID: 40282370 PMCID: PMC12027452 DOI: 10.3390/genes16040411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Revised: 03/23/2025] [Accepted: 03/26/2025] [Indexed: 04/29/2025] Open
Abstract
Background/Objectives: Genomic prediction is a powerful approach that predicts phenotypic traits from genotypic information, enabling the acceleration of trait improvement in plant breeding. Traditional genomic prediction methods have primarily relied on linear mixed models, such as Genomic Best Linear Unbiased Prediction (GBLUP), and conventional machine learning methods like Support Vector Regression (SVR). Traditional methods are limited in handling high-dimensional data and nonlinear relationships. Thus, deep learning methods have also been applied to genomic prediction in recent years. Methods: We proposed iADEP, Integrated Additive, Dominant, and Epistatic Prediction model based on deep learning. Specifically, single nucleotide polymorphism (SNP) data integrating latent genetic interactions and genome-wide association study results as biological prior knowledge are fused to an SNP embedding block, which is then input to a local encoder. The local encoder is fused with an omic-data-incorporated global decoder through a multi-head attention mechanism, followed by multilayer perceptrons. Results: Firstly, we demonstrated through experiments on four datasets that iADEP outperforms existing methods in genotype-to-phenotype prediction. Secondly, we validated the effectiveness of SNP embedding through ablation experiments. Third, we provided an available module for combining other omics data in iADEP and propose a novel method for fusing them. Fourthly, we explored the impact of feature selection on iADEP performance and conclude that utilizing the full set of SNPs generally provides optimal results. Finally, by altering the partition of training and testing sets, we investigated the differences between transductive learning and inductive learning. Conclusions: iADEP provides a new approach for AI breeding, a promising method that integrates biological prior knowledge and enables combination with other omics data.
Collapse
Affiliation(s)
- Chonghang Ye
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
- Hubei Hongshan Laboratory, Wuhan 430070, China;
| | - Kai Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
| | - Weicheng Sun
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
| | - Yiwei Jiang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
| | - Weihan Zhang
- Hubei Hongshan Laboratory, Wuhan 430070, China;
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Ping Zhang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
- School of Computer, BaoJi University of Arts and Sciences, Baoji 721016, China
| | - Yi-Juan Hu
- Department of Biostatistics, School of Public Health, Peking University, Beijing 100191, China;
- Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China
| | - Yuepeng Han
- Hubei Hongshan Laboratory, Wuhan 430070, China;
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Li Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (C.Y.); (K.L.); (W.S.); (Y.J.); (P.Z.)
- Hubei Hongshan Laboratory, Wuhan 430070, China;
| |
Collapse
|
4
|
Liu T, Chen Q, Liu R, Sun Y, Wang Y, Zhu Y, Zhao T. DMGAT: predicting ncRNA-drug resistance associations based on diffusion map and heterogeneous graph attention network. Brief Bioinform 2025; 26:bbaf179. [PMID: 40251829 PMCID: PMC12008124 DOI: 10.1093/bib/bbaf179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2025] [Revised: 03/26/2025] [Accepted: 03/30/2025] [Indexed: 04/21/2025] Open
Abstract
Non-coding RNAs (ncRNAs) play crucial roles in drug resistance and sensitivity, making them important biomarkers and therapeutic targets. However, predicting ncRNA-drug associations is challenging due to issues such as dataset imbalance and sparsity, limiting the identification of robust biomarkers. Existing models often fall short in capturing local and global sequence information, limiting the reliability of predictions. This study introduces DMGAT (diffusion map and heterogeneous graph attention network), a novel deep learning model designed to predict ncRNA-drug associations. DMGAT integrates diffusion maps for sequence embedding, graph convolutional networks for feature extraction, and GAT for heterogeneous information fusion. To address dataset imbalance, the model incorporates sensitivity associations and employs a random forest classifier to select reliable negative samples. DMGAT embeds ncRNA sequences and drug SMILES using the word2vec technique, capturing local and global sequence information. The model constructs a heterogeneous network by combining sequence similarity and Gaussian Interaction Profile kernel similarity, providing a comprehensive representation of ncRNA-drug interactions. Evaluated through five-fold cross-validation on a curated dataset from NoncoRNA and ncDR, DMGAT outperforms seven state-of-the-art methods, achieving the highest area under the receiver operating characteristic curve (0.8964), area under the precision-recall curve (0.8984), recall (0.9576), and F1-score (0.8285). The raw data are released to Zenodo with identifier 13929676. The source code of DMGAT is available at https://github.com/liutingyu0616/DMGAT/tree/main.
Collapse
Affiliation(s)
- Tingyu Liu
- School of Medicine and Heath, Harbin Institute of Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, China
| | - Qiuhao Chen
- Zhengzhou Research Institute, Harbin Instituteof Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, Heilongjiang, China
| | - Renjie Liu
- Zhengzhou Research Institute, Harbin Instituteof Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, Heilongjiang, China
| | - Yuzhi Sun
- School of Computer Science and Technology, Harbin Institute of Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, Heilongjiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, Heilongjiang, China
| | - Yan Zhu
- College of Veterinary Medicine, Northeast Agricultural University, 150038, Xiangfang District, Changjiang Road No. 600, Harbin, China
| | - Tianyi Zhao
- School of Medicine and Heath, Harbin Institute of Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, China
- Zhengzhou Research Institute, Harbin Instituteof Technology, 150000, Nangang District, Xidazhi Street No. 90, Harbin, Heilongjiang, China
| |
Collapse
|
5
|
Wang Y, Cheng J. Reconstructing 3D chromosome structures from single-cell Hi-C data with SO(3)-equivariant graph neural networks. NAR Genom Bioinform 2025; 7:lqaf027. [PMID: 40124711 PMCID: PMC11928942 DOI: 10.1093/nargab/lqaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Revised: 02/23/2025] [Accepted: 03/05/2025] [Indexed: 03/25/2025] Open
Abstract
The spatial conformation of chromosomes and genomes of single cells is relevant to cellular function and useful for elucidating the mechanism underlying gene expression and genome methylation. The chromosomal contacts (i.e. chromosomal regions in spatial proximity) entailing the three-dimensional (3D) structure of the genome of a single cell can be obtained by single-cell chromosome conformation capture techniques, such as single-cell Hi-C (ScHi-C). However, due to the sparsity of chromosomal contacts in ScHi-C data, it is still challenging for traditional 3D conformation optimization methods to reconstruct the 3D chromosome structures from ScHi-C data. Here, we present a machine learning-based method based on a novel SO(3)-equivariant graph neural network (HiCEGNN) to reconstruct 3D structures of chromosomes of single cells from ScHi-C data. HiCEGNN consistently outperforms both the traditional optimization methods and the only other deep learning method across diverse cells, different structural resolutions, and different noise levels of the data. Moreover, HiCEGNN is robust against the noise in the ScHi-C data.
Collapse
Affiliation(s)
- Yanli Wang
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
6
|
Wang H, Zhao L, Yu Z, Zeng X, Shi S. CoNglyPred: Accurate Prediction of N-Linked Glycosylation Sites Using ESM-2 and Structural Features With Graph Network and Co-Attention. Proteomics 2025; 25:e202400210. [PMID: 39361250 DOI: 10.1002/pmic.202400210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 08/17/2024] [Accepted: 09/20/2024] [Indexed: 03/18/2025]
Abstract
N-Linked glycosylation is crucial for various biological processes such as protein folding, immune response, and cellular transport. Traditional experimental methods for determining N-linked glycosylation sites entail substantial time and labor investment, which has led to the development of computational approaches as a more efficient alternative. However, due to the limited availability of 3D structural data, existing prediction methods often struggle to fully utilize structural information and fall short in integrating sequence and structural information effectively. Motivated by the progress of protein pretrained language models (pLMs) and the breakthrough in protein structure prediction, we introduced a high-accuracy model called CoNglyPred. Having compared various pLMs, we opt for the large-scale pLM ESM-2 to extract sequence embeddings, thus mitigating certain limitations associated with manual feature extraction. Meanwhile, our approach employs a graph transformer network to process the 3D protein structures predicted by AlphaFold2. The final graph output and ESM-2 embedding are intricately integrated through a co-attention mechanism. Among a series of comprehensive experiments on the independent test dataset, CoNglyPred outperforms state-of-the-art models and demonstrates exceptional performance in case study. In addition, we are the first to report the uncertainty of N-linked glycosylation predictors using expected calibration error and expected uncertainty calibration error.
Collapse
Affiliation(s)
- Hongmei Wang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Long Zhao
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Ziyuan Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Ximin Zeng
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| |
Collapse
|
7
|
Cui Z, Wu Y, Zhang QH, Wang SG, Guo ZH. NPENN: A Noise Perturbation Ensemble Neural Network for Microbiome Disease Phenotype Prediction. IEEE J Biomed Health Inform 2025; 29:2210-2221. [PMID: 40030297 DOI: 10.1109/jbhi.2024.3507789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2025]
Abstract
With advances in microbiomics, the crucial role of microbes in disease progression is increasingly recognized. However, predicting disease phenotypes using microbiome data remains challenging due to data complexity, heterogeneity, and limited model generalization. Current methods often depend on specific datasets and are vulnerable to adversarial attacks. To address these issues, this paper introduces a novel Noise Perturbation Ensemble Neural Network model (NPENN), which combines noise mechanisms with Gradient Boosting (GB) techniques for robust neural network ensemble learning. NPENN, validated on multiple microbiome datasets, shows superior accuracy and generalization compared to traditional methods, effectively handling data complexity and variability. This approach enhances model robustness and feature learning by integrating GB prior knowledge. Additionally, the study explores microbial community roles in various diseases, providing insights into disease mechanisms and potential biomarkers for personalized precision diagnosis and treatment strategies.
Collapse
|
8
|
Beltrán JF, Herrera-Belén L, Yáñez AJ, Jimenez L. Prediction of viral oncoproteins through the combination of generative adversarial networks and machine learning techniques. Sci Rep 2024; 14:27108. [PMID: 39511292 PMCID: PMC11543823 DOI: 10.1038/s41598-024-77028-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 10/18/2024] [Indexed: 11/15/2024] Open
Abstract
Viral oncoproteins play crucial roles in transforming normal cells into cancer cells, representing a significant factor in the etiology of various cancers. Traditionally, identifying these oncoproteins is both time-consuming and costly. With advancements in computational biology, bioinformatics tools based on machine learning have emerged as effective methods for predicting biological activities. Here, for the first time, we propose an innovative approach that combines Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of viral oncoprotein prediction. Our methodology evaluated multiple machine learning models, including Random Forest, Multilayer Perceptron, Light Gradient Boosting Machine, eXtreme Gradient Boosting, and Support Vector Machine. In ten-fold cross-validation on our training dataset, the GAN-enhanced Random Forest model demonstrated superior performance metrics: 0.976 accuracy, 0.976 F1 score, 0.977 precision, 0.976 sensitivity, and 1.0 AUC. During independent testing, this model achieved 0.982 accuracy, 0.982 F1 score, 0.982 precision, 0.982 sensitivity, and 1.0 AUC. These results establish our new tool, VirOncoTarget, accessible via a web application. We anticipate that VirOncoTarget will be a valuable resource for researchers, enabling rapid and reliable viral oncoprotein prediction and advancing our understanding of their role in cancer biology.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Alejandro J Yáñez
- Departamento de Investigación y Desarrollo, Greenvolution SpA, Puerto Varas, Chile
- Interdisciplinary Center for Aquaculture Research (INCAR), Concepcion, Chile
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile
| |
Collapse
|
9
|
Geng YQ, Lai FL, Luo H, Gao F. Nmix: a hybrid deep learning model for precise prediction of 2'-O-methylation sites based on multi-feature fusion and ensemble learning. Brief Bioinform 2024; 25:bbae601. [PMID: 39550226 PMCID: PMC11568878 DOI: 10.1093/bib/bbae601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 10/12/2024] [Accepted: 11/04/2024] [Indexed: 11/18/2024] Open
Abstract
RNA 2'-O-methylation (Nm) is a crucial post-transcriptional modification with significant biological implications. However, experimental identification of Nm sites is challenging and resource-intensive. While multiple computational tools have been developed to identify Nm sites, their predictive performance, particularly in terms of precision and generalization capability, remains deficient. We introduced Nmix, an advanced computational tool for precise prediction of Nm sites in human RNA. We constructed the largest, low-redundancy dataset of experimentally verified Nm sites and employed an innovative multi-feature fusion approach, combining one-hot, Z-curve and RNA secondary structure encoding. Nmix utilizes a meticulously designed hybrid deep learning architecture, integrating 1D/2D convolutional neural networks, self-attention mechanism and residual connection. We implemented asymmetric loss function and Bayesian optimization-based ensemble learning, substantially improving predictive performance on imbalanced datasets. Rigorous testing on two benchmark datasets revealed that Nmix significantly outperforms existing state-of-the-art methods across various metrics, particularly in precision, with average improvements of 33.1% and 60.0%, and Matthews correlation coefficient, with average improvements of 24.7% and 51.1%. Notably, Nmix demonstrated exceptional cross-species generalization capability, accurately predicting 93.8% of experimentally verified Nm sites in rat RNA. We also developed a user-friendly web server (https://tubic.org/Nm) and provided standalone prediction scripts to facilitate widespread adoption. We hope that by providing a more accurate and robust tool for Nm site prediction, we can contribute to advancing our understanding of Nm mechanisms and potentially benefit the prediction of other RNA modification sites.
Collapse
Affiliation(s)
- Yu-Qing Geng
- Department of Physics, School of Science, Tianjin University, No. 92 Weijin Road, Nankai District, Tianjin 300072, China
| | - Fei-Liao Lai
- Department of Physics, School of Science, Tianjin University, No. 92 Weijin Road, Nankai District, Tianjin 300072, China
| | - Hao Luo
- Department of Physics, School of Science, Tianjin University, No. 92 Weijin Road, Nankai District, Tianjin 300072, China
| | - Feng Gao
- Department of Physics, School of Science, Tianjin University, No. 92 Weijin Road, Nankai District, Tianjin 300072, China
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, No. 92 Weijin Road, Nankai District, Tianjin 300072, China
- SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), No. 92 Weijin Road, Nankai District, Tianjin 300072, China
| |
Collapse
|
10
|
Yu Q, Zhang Z, Liu G, Li W, Tang Y. ToxGIN: an In silico prediction model for peptide toxicity via graph isomorphism networks integrating peptide sequence and structure information. Brief Bioinform 2024; 25:bbae583. [PMID: 39530430 PMCID: PMC11555482 DOI: 10.1093/bib/bbae583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 10/22/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Peptide drugs have demonstrated enormous potential in treating a variety of diseases, yet toxicity prediction remains a significant challenge in drug development. Existing models for prediction of peptide toxicity largely rely on sequence information and often neglect the three-dimensional (3D) structures of peptides. This study introduced a novel model for short peptide toxicity prediction, named ToxGIN. The model utilizes Graph Isomorphism Network (GIN), integrating the underlying amino acid sequence composition and the 3D structures of peptides. ToxGIN comprises three primary modules: (i) Sequence processing module, converting peptide 3D structures and sequences into information of nodes and edges; (ii) Feature extraction module, utilizing GIN to learn discriminative features from nodes and edges; (iii) Classification module, employing a fully connected classifier for toxicity prediction. ToxGIN performed well on the independent test set with F1 score = 0.83, AUROC = 0.91, and Matthews correlation coefficient = 0.68, better than existing models for prediction of peptide toxicity. These results validated the effectiveness of integrating 3D structural information with sequence data using GIN for peptide toxicity prediction. The proposed ToxGIN and data can be freely accessible at https://github.com/cihebiyql/ToxGIN.
Collapse
Affiliation(s)
- Qiule Yu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Zhixing Zhang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
11
|
Rathore AS, Choudhury S, Arora A, Tijare P, Raghava GPS. ToxinPred 3.0: An improved method for predicting the toxicity of peptides. Comput Biol Med 2024; 179:108926. [PMID: 39038391 DOI: 10.1016/j.compbiomed.2024.108926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 05/17/2024] [Accepted: 07/17/2024] [Indexed: 07/24/2024]
Abstract
Toxicity emerges as a prominent challenge in the design of therapeutic peptides, causing the failure of numerous peptides during clinical trials. In 2013, our group developed ToxinPred, a computational method that has been extensively adopted by the scientific community for predicting peptide toxicity. In this paper, we propose a refined variant of ToxinPred that showcases improved reliability and accuracy in predicting peptide toxicity. Initially, we utilized a similarity/alignment-based approach employing BLAST to predict toxic peptides, which yielded satisfactory accuracy; however, the method suffered from inadequate coverage. Subsequently, we employed a motif-based approach using MERCI software to uncover specific patterns or motifs that are exclusively observed in toxic peptides. The search for these motifs in peptides allowed us to predict toxic peptides with a high level of specificity with poor sensitivity. To overcome the coverage limitations, we developed alignment-free methods using machine/deep learning techniques to balance sensitivity and specificity of prediction. Deep learning model (ANN - LSTM with fixed sequence length) developed using one-hot encoding achieved a maximum AUROC of 0.93 with MCC of 0.71 on an independent dataset. Machine learning model (extra tree) developed using compositional features of peptides achieved a maximum AUROC of 0.95 with MCC of 0.78. We also developed large language models and achieved maximum AUC of 0.93 using ESM2-t33. Finally, we developed hybrid or ensemble methods combining two or more methods to enhance performance. Our specific hybrid method, which combines a motif-based approach with a machine learning-based model, achieved a maximum AUROC of 0.98 with MCC 0.81 on an independent dataset. In this study, all models were trained and tested on 80 % of data using five-fold cross-validation and evaluated on the remaining 20 % of data called independent dataset. The evaluation of all methods on an independent dataset revealed that the method proposed in this study exhibited better performance than existing methods. To cater to the needs of the scientific community, we have developed a standalone software, pip package and web-based server ToxinPred3 (https://github.com/raghavagps/toxinpred3 and https://webs.iiitd.edu.in/raghava/toxinpred3/).
Collapse
Affiliation(s)
- Anand Singh Rathore
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Shubham Choudhury
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Akanksha Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Purva Tijare
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| |
Collapse
|
12
|
Nguyen VN, Ho TT, Doan TD, Le NQK. Using a hybrid neural network architecture for DNA sequence representation: A study on N 4-methylcytosine sites. Comput Biol Med 2024; 178:108664. [PMID: 38875905 DOI: 10.1016/j.compbiomed.2024.108664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 05/11/2024] [Accepted: 05/26/2024] [Indexed: 06/16/2024]
Abstract
N4-methylcytosine (4mC) is a modified form of cytosine found in DNA, contributing to epigenetic regulation. It exists in various genomes, including the Rosaceae family encompassing significant fruit crops like apples, cherries, and roses. Previous investigations have examined the distribution and functional implications of 4mC sites within the Rosaceae genome, focusing on their potential roles in gene expression regulation, environmental adaptation, and evolution. This research aims to improve the accuracy of predicting 4mC sites within the genome of Fragaria vesca, a Rosaceae plant species. Building upon the original 4mc-w2vec method, which combines word embedding processing and a convolutional neural network (CNN), we have incorporated additional feature encoding techniques and leveraged pre-trained natural language processing (NLP) models with different deep learning architectures including different forms of CNN, recurrent neural networks (RNN) and long short-term memory (LSTM). Our assessments have shown that the best model is derived from a CNN model using fastText encoding. This model demonstrates enhanced performance, achieving a sensitivity of 0.909, specificity of 0.77, and accuracy of 0.879 on an independent dataset. Furthermore, our model surpasses previously published works on the same dataset, thus showcasing its superior predictive capabilities.
Collapse
Affiliation(s)
- Van-Nui Nguyen
- University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam
| | - Trang-Thi Ho
- Department of Computer Science and Information Engineering, TamKang University, New Taipei, 251301, Taiwan
| | - Thu-Dung Doan
- International Degree Program in Animal Vaccine Technology, International College, National Pingtung University of Science and Technology, Pingtung, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 110, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei, 110, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, 110, Taiwan.
| |
Collapse
|
13
|
Zhou X, Yang J, Luo Y, Shen X. HNCGAT: a method for predicting plant metabolite-protein interaction using heterogeneous neighbor contrastive graph attention network. Brief Bioinform 2024; 25:bbae397. [PMID: 39162311 PMCID: PMC11730448 DOI: 10.1093/bib/bbae397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 07/15/2024] [Accepted: 07/27/2024] [Indexed: 08/21/2024] Open
Abstract
The prediction of metabolite-protein interactions (MPIs) plays an important role in plant basic life functions. Compared with the traditional experimental methods and the high-throughput genomics methods using statistical correlation, applying heterogeneous graph neural networks to the prediction of MPIs in plants can reduce the cost of manpower, resources, and time. However, to the best of our knowledge, applying heterogeneous graph neural networks to the prediction of MPIs in plants still remains under-explored. In this work, we propose a novel model named heterogeneous neighbor contrastive graph attention network (HNCGAT), for the prediction of MPIs in Arabidopsis. The HNCGAT employs the type-specific attention-based neighborhood aggregation mechanism to learn node embeddings of proteins, metabolites, and functional-annotations, and designs a novel heterogeneous neighbor contrastive learning framework to preserve heterogeneous network topological structures. Extensive experimental results and ablation study demonstrate the effectiveness of the HNCGAT model for MPI prediction. In addition, a case study on our MPI prediction results supports that the HNCGAT model can effectively predict the potential MPIs in plant.
Collapse
Affiliation(s)
- Xi Zhou
- School of Tropical Agriculture and Forestry, Hainan University, 58 Renmin Avenue, Haikou 570228, Hainan, China
| | - Jing Yang
- School of Tropical Agriculture and Forestry, Hainan University, 58 Renmin Avenue, Haikou 570228, Hainan, China
| | - Yin Luo
- School of Tropical Agriculture and Forestry, Hainan University, 58 Renmin Avenue, Haikou 570228, Hainan, China
| | - Xiao Shen
- School of Computer Science and Technology, Hainan University, 58 Renmin Avenue, Haikou 570228, Hainan, China
| |
Collapse
|
14
|
Arif R, Kanwal S, Ahmed S, Kabir M. A Computational Predictor for Accurate Identification of Tumor Homing Peptides by Integrating Sequential and Deep BiLSTM Features. Interdiscip Sci 2024; 16:503-518. [PMID: 38733473 DOI: 10.1007/s12539-024-00628-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 03/16/2024] [Accepted: 03/27/2024] [Indexed: 05/13/2024]
Abstract
Cancer remains a severe illness, and current research indicates that tumor homing peptides (THPs) play an important part in cancer therapy. The identification of THPs can provide crucial insights for drug-discovery and pharmaceutical industries as they allow for tailored medication delivery towards cancer cells. These peptides have a high affinity enabling particular receptors present upon tumor surfaces, allowing for the creation of precision medications that reduce off-target consequences and enhance cancer patient treatment results. Wet-lab techniques are considered essential tools for studying THPs; however, they're labor-extensive and time-consuming, therefore making prediction of THPs a challenging task for the researchers. Computational-techniques, on the other hand, are considered significant tools in identifying THPs according to the sequence data. Despite many strategies have been presented to predict new THP, there is still a need to develop a robust method with higher rates of success. In this paper, we developed a novel framework, THP-DF, for accurately identifying THPs on a large-scale. Firstly, the peptide sequences are encoded through various sequential features. Secondly, each feature is passed to BiLSTM and attention layers to extract simplified deep features. Finally, an ensemble-framework is formed via integrating sequential- and deep features which are fed to a support vector machine which with 10-fold cross-validation to carry to validate the efficiency. The experimental results showed that THP-DF worked better on both [Formula: see text] and [Formula: see text] datasets by achieving accuracy of > 95% which are higher than existing predictors both datasets. This indicates that the proposed predictor could be a beneficial tool to precisely and rapidly identify THPs and will contribute to the cutting-edge cancer treatment strategies and pharmaceuticals.
Collapse
Affiliation(s)
- Roha Arif
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Sameera Kanwal
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan.
| |
Collapse
|
15
|
Beltrán JF, Herrera-Belén L, Parraguez-Contreras F, Farías JG, Machuca-Sepúlveda J, Short S. MultiToxPred 1.0: a novel comprehensive tool for predicting 27 classes of protein toxins using an ensemble machine learning approach. BMC Bioinformatics 2024; 25:148. [PMID: 38609877 PMCID: PMC11010298 DOI: 10.1186/s12859-024-05748-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 03/14/2024] [Indexed: 04/14/2024] Open
Abstract
Protein toxins are defense mechanisms and adaptations found in various organisms and microorganisms, and their use in scientific research as therapeutic candidates is gaining relevance due to their effectiveness and specificity against cellular targets. However, discovering these toxins is time-consuming and expensive. In silico tools, particularly those based on machine learning and deep learning, have emerged as valuable resources to address this challenge. Existing tools primarily focus on binary classification, determining whether a protein is a toxin or not, and occasionally identifying specific types of toxins. For the first time, we propose a novel approach capable of classifying protein toxins into 27 distinct categories based on their mode of action within cells. To accomplish this, we assessed multiple machine learning techniques and found that an ensemble model incorporating the Light Gradient Boosting Machine and Quadratic Discriminant Analysis algorithms exhibited the best performance. During the tenfold cross-validation on the training dataset, our model exhibited notable metrics: 0.840 accuracy, 0.827 F1 score, 0.836 precision, 0.840 sensitivity, and 0.989 AUC. In the testing stage, using an independent dataset, the model achieved 0.846 accuracy, 0.838 F1 score, 0.847 precision, 0.849 sensitivity, and 0.991 AUC. These results present a powerful next-generation tool called MultiToxPred 1.0, accessible through a web application. We believe that MultiToxPred 1.0 has the potential to become an indispensable resource for researchers, facilitating the efficient identification of protein toxins. By leveraging this tool, scientists can accelerate their search for these toxins and advance their understanding of their therapeutic potential.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Fernanda Parraguez-Contreras
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge G Farías
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge Machuca-Sepúlveda
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Stefania Short
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| |
Collapse
|
16
|
Lee TF, Chang CH, Shao JC, Liu YH, Chiu CL, Hsieh YW, Lee SH, Chao PJ, Yeh SA. Revolution of Medical Review: The Application of Meta-Analysis and Convolutional Neural Network-Natural Language Processing in Classifying the Literature for Head and Neck Cancer Radiotherapy. Cancer Control 2024; 31:10732748241286688. [PMID: 39323027 PMCID: PMC11439162 DOI: 10.1177/10732748241286688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 08/20/2024] [Accepted: 09/06/2024] [Indexed: 09/27/2024] Open
Abstract
This study explored the application of meta-analysis and convolutional neural network-natural language processing (CNN-NLP) technologies in classifying literature concerning radiotherapy for head and neck cancer. It aims to enhance both the efficiency and accuracy of literature reviews. By integrating statistical analysis with deep learning, this research successfully identified key studies related to the probability of normal tissue complications (NTCP) from a vast corpus of literature. This demonstrates the advantages of these technologies in recognizing professional terminology and extracting relevant information. The findings not only improve the quality of literature reviews but also offer new insights for future research on optimizing medical studies through AI technologies. Despite the challenges related to data quality and model generalization, this work provides clear directions for future research.
Collapse
Affiliation(s)
- Tsair-Fwu Lee
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Graduate Institute of Clinical Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Medical Imaging and Radiological Sciences, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Chu-Ho Chang
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Jen-Chung Shao
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Yen-Hsien Liu
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Chien-Liang Chiu
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Yang-Wei Hsieh
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Shen-Hao Lee
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Pei-Ju Chao
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Department of Radiation Oncology, E-DA Hospital, Kaohsiung, Taiwan
| | - Shyh-An Yeh
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Department of Medical Imaging and Radiological Sciences, I-Shou University, Kaohsiung, Taiwan
- Department of Radiation Oncology, E-DA Hospital, Kaohsiung, Taiwan
| |
Collapse
|
17
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|