1
|
Li N, Qiao J, Gao F, Wang Y, Shi H, Zhang Z, Cui F, Zhang L, Wei L. GICL: A Cross-Modal Drug Property Prediction Framework Based on Knowledge Enhancement of Large Language Models. J Chem Inf Model 2025. [PMID: 40432191 DOI: 10.1021/acs.jcim.5c00895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2025]
Abstract
Deep learning models have demonstrated their potential in learning effective molecular representations critical for drug property prediction and drug discovery. Despite significant advancements in leveraging multimodal drug molecule semantics, existing approaches often struggle with challenges such as low-quality data and structural complexity. Large language models (LLMs) excel in generating high-quality molecular representations due to their robust characterization capabilities. In this work, we introduce GICL, a cross-modal contrastive learning framework that integrates LLM-derived embeddings with molecular image representations. Specifically, LLMs extract feature representations from the SMILES strings of drug molecules, which are then contrasted with graphical representations of molecular images to achieve a holistic understanding of molecular features. Experimental results demonstrate that GICL achieves state-of-the-art performance on the ADMET task while offering interpretable insights into drug properties, thereby facilitating more efficient drug design and discovery.
Collapse
Affiliation(s)
- Na Li
- School of Computer and Information Engineering, Qilu Institute of Technology, Jinan 250200, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250100, China
| | - Fei Gao
- School of Computer and Information Engineering, Qilu Institute of Technology, Jinan 250200, China
| | - Yanling Wang
- School of Computer and Information Engineering, Qilu Institute of Technology, Jinan 250200, China
| | - Hua Shi
- School of Optoelectronic and Communication Engineering, Xiamen University of Technology, Xiamen 361005, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Leyi Wei
- Macao Polytechnic University, Faculty of Applied Science, Centre for Artificial Intelligence Driven Drug Discovery, Macau 999078, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| |
Collapse
|
2
|
Li J, Xiong S, Shi H, Cui F, Zhang Z, Wei L. NeuroPred-AIMP: Multimodal Deep Learning for Neuropeptide Prediction via Protein Language Modeling and Temporal Convolutional Networks. J Chem Inf Model 2025; 65:4740-4750. [PMID: 40258183 DOI: 10.1021/acs.jcim.5c00444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/23/2025]
Abstract
Neuropeptides are key signaling molecules that regulate fundamental physiological processes ranging from metabolism to cognitive function. However, accurate identification is a huge challenge due to sequence heterogeneity, obscured functional motifs and limited experimentally validated data. Accurate identification of neuropeptides is critical for advancing neurological disease therapeutics and peptide-based drug design. Existing neuropeptide identification methods rely on manual features combined with traditional machine learning methods, which are difficult to capture the deep patterns of sequences. To address these limitations, we propose NeuroPred-AIMP (adaptive integrated multimodal predictor), an interpretable model that synergizes global semantic representation of the protein language model (ESM) and the multiscale structural features of the temporal convolutional network (TCN). The model introduced the adaptive features fusion mechanism of residual enhancement to dynamically recalibrate feature contributions, to achieve robust integration of evolutionary and local sequence information. The experimental results demonstrated that the proposed model showed excellent comprehensive performance on the independence test set, with an accuracy of 92.3% and the AUROC of 0.974. Simultaneously, the model showed good balance in the ability to identify positive and negative samples, with a sensitivity of 92.6% and a specificity of 92.1%, with a difference of less than 0.5%. The result fully confirms the effectiveness of the multimodal features strategy in the task of neuropeptide recognition.
Collapse
Affiliation(s)
- Jinjin Li
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
| | - Shuwen Xiong
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
| | - Hua Shi
- School of Optoelectronic and Communication Engineering, Xiamen University of Technology, Xiamen 361024, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
3
|
Xiao Z, Li Y, Ding Y, Yu L. EPIPDLF: a pretrained deep learning framework for predicting enhancer-promoter interactions. Bioinformatics 2025; 41:btae716. [PMID: 40036975 PMCID: PMC12057809 DOI: 10.1093/bioinformatics/btae716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 11/04/2024] [Accepted: 02/26/2025] [Indexed: 03/06/2025] Open
Abstract
MOTIVATION Enhancers and promoters, as regulatory DNA elements, play pivotal roles in gene expression, homeostasis, and disease development across various biological processes. With advancing research, it has been uncovered that distal enhancers may engage with nearby promoters to modulate the expression of target genes. This discovery holds significant implications for deepening our comprehension of various biological mechanisms. In recent years, numerous high-throughput wet-lab techniques have been created to detect possible interactions between enhancers and promoters. However, these experimental methods are often time-intensive and costly. RESULTS To tackle this issue, we have created an innovative deep learning approach, EPIPDLF, which utilizes advanced deep learning techniques to predict EPIs based solely on genomic sequences in an interpretable manner. Comparative evaluations across six benchmark datasets demonstrate that EPIPDLF consistently exhibits superior performance in EPI prediction. Additionally, by incorporating interpretable analysis mechanisms, our model enables the elucidation of learned features, aiding in the identification and biological analysis of important sequences. AVAILABILITY AND IMPLEMENTATION The source code and data are available at: https://github.com/xzc196/EPIPDLF.
Collapse
Affiliation(s)
- Zhichao Xiao
- School of Computer Science and Technology, Xidian University, Xi'an 710075, China
| | - Yan Li
- School of Management, Xi'an Polytechnic University, Xi'an 710075, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710075, China
| |
Collapse
|
4
|
Zhang R, Zhang X, Zhao S, Zou Q, Ding Y, Guo X, Wu H. Beyond ST-246: Unveiling Potential Inhibitors Targeting VP37 Protein in Silico From Herb and Marine Databases. J Comput Chem 2025; 46:e70111. [PMID: 40271912 DOI: 10.1002/jcc.70111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2025] [Revised: 03/28/2025] [Accepted: 04/06/2025] [Indexed: 04/25/2025]
Abstract
In pursuit of unraveling novel structural inhibitors for treating monkeypox virus, targeting the VP37 protein, which is bioactive in response to ST-246, to discern pharmaceutical molecules specifically tailored to combat monkeypox virus. We employed a semi-flexible molecular docking, molecular dynamic simulation, and ADME screening methodology, which are based on structure, to screen compounds from CMNPD and TCM in silico. These methodologies allowed us to find potential candidates depending on their binding values and interactions with the binding site of main protease. To further evaluate the stability of these interactions, we conducted molecular dynamics simulations and calculated binding energies. Herein, employing methods such as binding energy calculations, comparative analyses, and molecular dynamics simulations for activity computations, the six top hits of the compounds were validated as five kinds of good inhibitors, surpassing its reference compound ST-246, for better in vitro drug candidates against MPXV.
Collapse
Affiliation(s)
- Runhua Zhang
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Xin Zhang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China
| | - Shulin Zhao
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
5
|
Yu D, Yang X, Shang Y, Yuan S, Liu Y, Liu Y. Drug-target interaction prediction based on metapaths and simplified neighbor aggregation. Methods 2025; 240:154-164. [PMID: 40288620 DOI: 10.1016/j.ymeth.2025.04.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2025] [Revised: 04/02/2025] [Accepted: 04/20/2025] [Indexed: 04/29/2025] Open
Abstract
Drug-target interaction (DTI) prediction is critical in drug repositioning and discovery. In current metapath-based prediction methods, attention mechanisms are often used to differentiate the importance of various neighbors, enhancing the model's expressiveness. However, in biological networks with small-scale imbalanced data, attention mechanisms are prone to interference from noise and missing data, leading to instability in weight learning, reduced efficiency, and an increased risk of overfitting. To address these issues, we propose the use of average aggregation to mitigate noise, simplify model complexity, and improve stability. Specifically, we introduce a simplified mean aggregation method for DTI prediction. This approach uses average aggregation, effectively reducing noise interference, lowering model complexity, and preventing overfitting, making it especially suitable for current biological networks. Extensive testing on three heterogeneous biological datasets shows that SNADTI outperforms 12 leading methods across two evaluation metrics, significantly reducing training time and validating its effectiveness in DTI prediction. Complexity analysis reveals that our method offers a substantial computational speed advantage over other methods on the same dataset, highlighting its enhanced efficiency. Experimental results demonstrate that SNADTI excels in prediction accuracy, stability, and reproducibility, confirming its practicality and effectiveness in DTI prediction.
Collapse
Affiliation(s)
- Di Yu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
| | - Xinyu Yang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
| | - Yifan Shang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China; Department of Biomedical Engineering, The Chinese University of Hong Kong, 999077, Hong Kong, China.
| | - Sisi Yuan
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, 28223, NC, USA
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, Hunan, China
| |
Collapse
|
6
|
Cao C, Li M, Wang C, Xu L, Zou Q, Wang Y, Han W. DGCLCMI: a deep graph collaboration learning method to predict circRNA-miRNA interactions. BMC Biol 2025; 23:104. [PMID: 40264118 PMCID: PMC12016396 DOI: 10.1186/s12915-025-02197-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Accepted: 03/25/2025] [Indexed: 04/24/2025] Open
Abstract
BACKGROUND Numerous studies have shown that circRNA can act as a miRNA sponge, competitively binding to miRNAs, thereby regulating gene expression and disease progression. Due to the high cost and time-consuming nature of traditional wet lab experiments, analyzing circRNA-miRNA associations is often inefficient and labor-intensive. Although some computational models have been developed to identify these associations, they fail to capture the deep collaborative features between circRNA and miRNA interactions and do not guide the training of feature extraction networks based on these high-order relationships, leading to poor prediction performance. RESULTS To address these issues, we innovatively propose a novel deep graph collaboration learning method for circRNA-miRNA interaction, called DGCLCMI. First, it uses word2vec to encode sequences into word embeddings. Next, we present a joint model that combines an improved neural graph collaborative filtering method with a feature extraction network for optimization. Deep interaction information is embedded as informative features within the sequence representations for prediction. Comprehensive experiments on three well-established datasets across seven metrics demonstrate that our algorithm significantly outperforms previous models, achieving an average AUC of 0.960. In addition, a case study reveals that 18 out of 20 predicted unknown CMI data points are accurate. CONCLUSIONS The DGCLCMI improves circRNA and miRNA feature representation by capturing deep collaborative information, achieving superior performance compared to prior methods. It facilitates the discovery of unknown associations and sheds light on their roles in physiological processes.
Collapse
Affiliation(s)
- Chao Cao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, 324003, China
| | - Mengli Li
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, 324003, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, Guangdong, 518055, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, 324003, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
| | - Wu Han
- Department of Statistics, Stanford University, Stanford, CA, 94043, USA.
| |
Collapse
|
7
|
Sheng N, Qiao J, Wei L, Shi H, Guo H, Yang C. Computational models for prediction of m6A sites using deep learning. Methods 2025; 240:113-124. [PMID: 40268153 DOI: 10.1016/j.ymeth.2025.04.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 04/02/2025] [Accepted: 04/07/2025] [Indexed: 04/25/2025] Open
Abstract
RNA modifications play a crucial role in enhancing the structural and functional diversity of RNA molecules and regulating various stages of the RNA life cycle. Among these modifications, N6-Methyladenosine (m6A) is the most common internal modification in eukaryotic mRNAs and has been extensively studied over the past decade. Accurate identification of m6A modification sites is essential for understanding their function and underlying mechanisms. Traditional methods predominantly rely on machine learning techniques to recognize m6A sites, which often fail to capture the contextual features of these sites comprehensively. In this study, we comprehensively summarize previously published methods based on machine learning and deep learning. We also validate multiple deep learning approaches on benchmark dataset, including previously underutilized methods in m6A site prediction, pre-trained models specifically designed for biological sequence and other basic deep learning methods. Additionally, we further analyze the dataset features and interpret the model's predictions to enhance understanding. Our experimental results clearly demonstrate the effectiveness of the deep learning models, elucidating their strong potential in accurately recognizing m6A modification sites.
Collapse
Affiliation(s)
- Nan Sheng
- School of Software, Shandong University, Jinan 250101, PR China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, PR China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, PR China
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, PR China
| | - Huannan Guo
- Beidahuang Industry Group General Hospital, PR China.
| | - Changshun Yang
- Department of Gastrointestinal Surgery, Fuzhou University Affiliated Provincial Hospital, Fuzhou 350004, PR China.
| |
Collapse
|
8
|
Xiong S, Cai J, Shi H, Cui F, Zhang Z, Wei L. UMPPI: Unveiling Multilevel Protein-Peptide Interaction Prediction via Language Models. J Chem Inf Model 2025; 65:3789-3799. [PMID: 40077987 DOI: 10.1021/acs.jcim.4c02365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2025]
Abstract
Protein-peptide interactions are essential to cellular processes and disease mechanisms. Identifying protein-peptide binding residues is critical for understanding peptide function and advancing drug discovery. However, experimental methods are costly and time-intensive, while existing computational approaches often predict interactions or binding residues separately, lack effective feature integration, or rely heavily on limited high-quality structural data. To address these challenges, we propose UMPPI (Unveiling Multilevel Protein-Peptide Interaction), a multiobjective framework based on the pretrained protein language model ESM2. UMPPI simultaneously predicts binary protein-peptide interactions and binding residues on both peptides and proteins through a multiobjective optimization strategy. By integrating ESM2 to encode sequences and extract latent structural information, UMPPI bridges the gap between sequence-based and structure-based methods. Extensive experiments demonstrated that UMPPI successfully captured binary interactions between peptides and proteins and identified the binding residues on peptides and proteins. UMPPI can serve as a useful tool for protein-peptide interaction prediction and identification of critical binding residues, thereby facilitating the peptide drug discovery process.
Collapse
Affiliation(s)
- Shuwen Xiong
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
| | - Jiajie Cai
- School of Software, Shandong University, Jinan 250101, China
| | - Hua Shi
- School of Optoelectronic and Communication Engineering, Xiamen University of Technology, Xiamen 361005, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
9
|
Lai H, Luo D, Yang M, Zhu T, Yang H, Luo X, Wei Y, Xie S, Hong F, Shu K, Dao F, Ding H. PBertKla: a protein large language model for predicting human lysine lactylation sites. BMC Biol 2025; 23:95. [PMID: 40189537 PMCID: PMC11974188 DOI: 10.1186/s12915-025-02202-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 03/31/2025] [Indexed: 04/09/2025] Open
Abstract
BACKGROUND Lactylation is a newly discovered type of post-translational modification, primarily occurring on lysine (K) residues of both histones and non-histones to exert diverse effects on target proteins. Research has shown that lysine lactylation (Kla) modification is ubiquitous in different cells and participates in the determination of cell function and fate, as well as in the initiation and progression of various diseases. Precise identification of Kla sites is fundamental for elucidating their biological functions and uncovering their application potential. RESULTS Here, we proposed a novel human Kla site predictor (named PBertKla) through curating a reliable benchmark dataset with proper sample length and sequence identity threshold to train a protein large language model with optimal hyperparameters. Extensive experimental results consistently demonstrated that our model possessed robust human Kla site prediction ability, achieving an AUC (area under receiver operating characteristic curve) value of over 0.880 on the independent validation data. Feature visualization analysis further validated the effectiveness of in feature learning and representation from Kla sequences. Moreover, we benchmarked PBertKla against other cutting-edge models on an independent testing dataset from different sources, highlighting its superiority and transferability. CONCLUSIONS All results indicated that PBertKla excelled as an automatic predictor of human Kla sites, and it would advance the investigation of lactylation modifications and their significance in health and disease.
Collapse
Affiliation(s)
- Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Diyu Luo
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Mi Yang
- Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Tao Zhu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Huan Yang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, 324000, China
| | - Xinwei Luo
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Yijie Wei
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Sijia Xie
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Feitong Hong
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Fuying Dao
- School of Biological Sciences, Nanyang Technological University, Singapore, 639798, Singapore.
| | - Hui Ding
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
| |
Collapse
|
10
|
Meng C, Pei Y, Bu Y, Liu Q, Li Q, Zou Q, Zhang Y. IIFS2.0: An Improved Incremental Feature Selection Method for Protein Sequence Processing Based on a Caching Strategy. J Mol Biol 2025; 437:168741. [PMID: 39122168 DOI: 10.1016/j.jmb.2024.168741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/08/2024] [Accepted: 08/05/2024] [Indexed: 08/12/2024]
Abstract
The purpose of feature selection in protein sequence recognition problems is to select the optimal feature set and use it as training input for classifiers and discover key sequence features of specific proteins. In the feature selection process, relevant features associated with the target task will be retained, and irrelevant and redundant features will be removed. Therefore, in an ideal state, a feature combination with smaller feature dimensions and higher performance indicators is desired. This paper proposes an algorithm called IIFS2.0 based on the cache elimination strategy, which takes the local optimal combination of cached feature subsets as a breakthrough point. It searches for a new feature combination method through the cache elimination strategy to avoid the drawbacks of human factors and excessive reliance on feature sorting results. We validated and analyzed its effectiveness on the protein dataset, demonstrating that IIFS2.0 significantly reduces the dimensionality of feature combinations while also improving various evaluation indicators. In addition, we provide IIFS2.0 on https://112.124.26.17:8006/ for researchers to use.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| | - Ying Zhang
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| |
Collapse
|
11
|
Lu R, Qiao J, Li K, Zhao Y, Jin J, Cui F, Zhang Z, Manavalan B, Wei L. ERNIE-ac4C: A Novel Deep Learning Model for Effectively Predicting N4-acetylcytidine Sites. J Mol Biol 2025; 437:168978. [PMID: 39900287 DOI: 10.1016/j.jmb.2025.168978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 01/05/2025] [Accepted: 01/28/2025] [Indexed: 02/05/2025]
Abstract
RNA modifications are known to play a critical role in gene regulation and cellular processes. Specifically, N4-acetylcytidine (ac4C) modification has emerged as a significant marker involved in mRNA translation efficiency, stability, and various diseases. Accurate identification of ac4C modification sites is essential for unraveling its functional implications. However, currently available experimental methods suffer from drawbacks such as lengthy detection times, complexity, and high costs, resulting in low efficiency and accuracy in prediction. Although several bioinformatics methods have been proposed and have advanced the prediction of ac4C modification sites, there is still ample room for improvement. In this research, we propose a novel deep learning model, ERNIE-ac4C, which combines the ERNIE-RNA language model and a two-dimensional Convolutional Neural Network (CNN). ERNIE-ac4C utilizes the fusion of sequence features and attention map features to predict ac4C modification sites. ERNIE-ac4C surpasses other state-of-the-art deep learning methods, demonstrating superior accuracy and effectiveness. The availability of the code on GitHub (https://github.com/lrlbcxdd/ERNIEac4C.git) and our openness to feedback from the research community contribute to the model's accessibility and its potential for further advancements. Our study provides valuable insights into ac4C research and enhances our understanding of the functional consequences of RNA modifications.
Collapse
Affiliation(s)
- Ronglin Lu
- School of Software, Shandong University, Jinan 250101 China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101 China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101 China
| | - Kefei Li
- School of Software, Shandong University, Jinan 250101 China
| | - Yanxi Zhao
- School of Software, Shandong University, Jinan 250101 China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101 China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101 China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228 China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228 China
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419 Gyeonggi-do, Republic of Korea.
| | - Leyi Wei
- Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR, China.
| |
Collapse
|
12
|
Zhang HQ, Arif M, Thafar MA, Albaradei S, Cai P, Zhang Y, Tang H, Lin H. PMPred-AE: a computational model for the detection and interpretation of pathological myopia based on artificial intelligence. Front Med (Lausanne) 2025; 12:1529335. [PMID: 40182849 PMCID: PMC11965940 DOI: 10.3389/fmed.2025.1529335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2024] [Accepted: 02/27/2025] [Indexed: 04/05/2025] Open
Abstract
Introduction Pathological myopia (PM) is a serious visual impairment that may lead to irreversible visual damage or even blindness. Timely diagnosis and effective management of PM are of great significance. Given the increasing number of myopia cases worldwide, there is an urgent need to develop an automated, accurate, and highly interpretable PM diagnostic technology. Methods We proposed a computational model called PMPred-AE based on EfficientNetV2-L with attention mechanism optimization. In addition, Gradient-weighted class activation mapping (Grad-CAM) technology was used to provide an intuitive and visual interpretation for the model's decision-making process. Results The experimental results demonstrated that PMPred-AE achieved excellent performance in automatically detecting PM, with accuracies of 98.50, 98.25, and 97.25% in the training, validation, and test datasets, respectively. In addition, PMPred-AE can focus on specific areas of PM image when making detection decisions. Discussion The developed PMPred-AE model is capable of reliably providing accurate PM detection. In addition, the Grad-CAM technology was also used to provide an intuitive and visual interpretation for the decision-making process of the model. This approach provides healthcare professionals with an effective tool for interpretable AI decision-making process.
Collapse
Affiliation(s)
- Hong-Qi Zhang
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Maha A. Thafar
- Computer Science Department, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Somayah Albaradei
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Peiling Cai
- School of Basic Medical Sciences, Chengdu University, Chengdu, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou, China
- Central Nervous System Drug Key Laboratory of Sichuan Province, Luzhou, China
| | - Hao Lin
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
13
|
Zuo Y, Fang X, Chen J, Ji J, Li Y, Wu Z, Liu X, Zeng X, Deng Z, Yin H, Zhao A. MlyPredCSED: based on extreme point deviation compensated clustering combined with cross-scale convolutional neural networks to predict multiple lysine sites in human. Brief Bioinform 2025; 26:bbaf189. [PMID: 40285360 PMCID: PMC12031725 DOI: 10.1093/bib/bbaf189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 03/27/2025] [Accepted: 04/03/2025] [Indexed: 04/29/2025] Open
Abstract
In post-translational modification, covalent bonds on lysine and attached chemical groups significantly change proteins' physical and chemical properties. They shape protein structures, enhance function and stability, and are vital for physiological processes, affecting health and disease through mechanisms like gene expression, signal transduction, protein degradation, and cell metabolism. Although lysine (K) modification sites are considered among the most common types of post-translational modifications in proteins, research on K-PTMs has largely overlooked the synergistic effects between different modifications and lacked the techniques to address the problem of sample imbalance. Based on this, the Extreme Point Deviation Compensated Clustering (EPDCC) Undersampling algorithm was proposed in this study and combined with Cross-Scale Convolutional Neural Networks (CSCNNs) to develop a novel computational tool, MlyPredCSED, for simultaneously predicting multiple lysine modification sites. MlyPredCSED employs Multi-Label Position-Specific Triad Amino Acid Propensity and the physicochemical properties of amino acids to enhance the richness of sequence information. To address the challenge of sample imbalance, the innovative EPDCC Undersampling technique was introduced to adjust the majority class samples. The model's training and testing phase relies on the advanced CSCNN framework. MlyPredCSED, through cross-validation and testing, outperformed existing models, especially in complex categories with multiple modification sites. This research not only provides an efficient method for the identification of lysine modification sites but also demonstrates its value in biological research and drug development. To facilitate efficient use of MlyPredCSED by researchers, we have specifically developed an accessible free web tool: http://www.mlypredcsed.com.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xingze Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Jiankang Chen
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Jiayi Ji
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yuwen Li
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Zeyu Wu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Hongwei Yin
- Department of Oncology, The First Affiliated Hospital of Naval Military Medical University, Shanghai 200000, China
| | - Anjing Zhao
- Department of Oncology, The First Affiliated Hospital of Naval Military Medical University, Shanghai 200000, China
| |
Collapse
|
14
|
da Silva JR, Castro-Amorim J, Mukherjee AK, Ramos MJ, Fernandes PA. The application of snake venom in anticancer drug discovery: an overview of the latest developments. Expert Opin Drug Discov 2025:1-19. [PMID: 40012249 DOI: 10.1080/17460441.2025.2465364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Accepted: 02/07/2025] [Indexed: 02/28/2025]
Abstract
INTRODUCTION Snake venom is a rich source of toxins with great potential for therapeutic applications. In addition to its efficacy in treating hypertension, acute coronary syndrome, and other heart conditions, research has shown that this potent enzymatic cocktail is capable of selectively targeting and destroying cancer cells in many cases while sparing healthy cells. AREAS COVERED The authors begin by acknowledging the emerging trends in snake-derived targeted therapies in battling cancer. An extensive literature review examining the effects of various snake venom toxins on cancer cell lines, highlighting the specific cancer hallmarks each toxin targets is presented. Furthermore, the authors emphasize the emerging potential of artificial intelligence in accelerating snake venom-based drug discovery for cancer treatment, showcasing several innovative software applications in this field. EXPERT OPINION Research on snake venom toxins indicates promising potential for cancer treatment as many of the discussed toxins can specifically target cancer cells. Nevertheless, variations in the composition of venoms, ethical issues, and delivery barriers limit their development into effective therapies. Thus, advances in biotechnology, molecular engineering, in silico methods are crucial for the refinement of venom-derived compounds, improving their specificity, and overcoming these challenges, ultimately enhancing their therapeutic potential in cancer therapy.
Collapse
Affiliation(s)
- Joana R da Silva
- LAQV, REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - Juliana Castro-Amorim
- LAQV, REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - Ashis K Mukherjee
- Vigyan Path Garchuk, Paschim Boragaon institution, Institute of Advanced Study in Science and Technology, Guwahati, India
| | - Maria João Ramos
- LAQV, REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| | - Pedro A Fernandes
- LAQV, REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, Porto, Portugal
| |
Collapse
|
15
|
Li R, Yu J, Ye D, Liu S, Zhang H, Lin H, Feng J, Deng K. Conotoxins: Classification, Prediction, and Future Directions in Bioinformatics. Toxins (Basel) 2025; 17:78. [PMID: 39998095 PMCID: PMC11860864 DOI: 10.3390/toxins17020078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Revised: 01/25/2025] [Accepted: 02/07/2025] [Indexed: 02/26/2025] Open
Abstract
Conotoxins, a diverse family of disulfide-rich peptides derived from the venom of Conus species, have gained prominence in biomedical research due to their highly specific interactions with ion channels, receptors, and neurotransmitter systems. Their pharmacological properties make them valuable molecular tools and promising candidates for therapeutic development. However, traditional conotoxin classification and functional characterization remain labor-intensive, necessitating the increasing adoption of computational approaches. In particular, machine learning (ML) techniques have facilitated advancements in sequence-based classification, functional prediction, and de novo peptide design. This review explores recent progress in applying ML and deep learning (DL) to conotoxin research, comparing key databases, feature extraction techniques, and classification models. Additionally, we discuss future research directions, emphasizing the integration of multimodal data and the refinement of predictive frameworks to enhance therapeutic discovery.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Kejun Deng
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (R.L.); (J.Y.); (D.Y.); (S.L.); (H.Z.); (H.L.); (J.F.)
| |
Collapse
|
16
|
Wu CY, Xu ZX, Li N, Qi DY, Wu HY, Ding H, Jin YT. Predicting cyclins based on key features and machine learning methods. Methods 2025; 234:112-119. [PMID: 39694304 DOI: 10.1016/j.ymeth.2024.12.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Revised: 12/12/2024] [Accepted: 12/15/2024] [Indexed: 12/20/2024] Open
Abstract
Cyclins are a group of proteins that regulate the cell cycle process by modulating various stages of cell division to ensure correct cell proliferation, differentiation, and apoptosis. Research on cyclins is crucial for understanding the biological functions and pathological states of cells. However, current research on cyclin identification based on machine learning only focuses on accuracy ignoring the interpretability of features. Therefore, in this study, we pay more attention to the interpretation and analysis of key features associated with cyclins. Firstly, we developed an SVM-based model for identifying cyclins with an accuracy of 92.8% through 5-fold. Then we analyzed the physicochemical properties of the 14 key features used in the model construction and identified the G and charged C1 features that are critical for distinguishing cyclins from non-cyclins. Furthermore, we constructed an SVM-based model using only these two features with an accuracy of 81.3% through the leave-one-out cross-validation. Our study shows that cyclins differ from non-cyclins in their physicochemical properties and that using only two features can achieve good prediction accuracy.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Zhi-Xue Xu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Nan Li
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Dan-Yang Qi
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Hong-Ye Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
17
|
Basith S, Manavalan B, Lee G. AntiT2DMP-Pred: Leveraging feature fusion and optimization for superior machine learning prediction of type 2 diabetes mellitus. Methods 2025; 234:264-274. [PMID: 39798942 DOI: 10.1016/j.ymeth.2025.01.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 12/26/2024] [Accepted: 01/04/2025] [Indexed: 01/15/2025] Open
Abstract
Pancreatic α-amylase breaks down starch into isomaltose and maltose, which are further hydrolyzed by α-glucosidase in the intestine into monosaccharides, rapidly raising blood sugar levels and contributing to type 2 diabetes mellitus (T2DM). Synthetic inhibitors of carbohydrate-digesting enzymes are used to manage T2DM but may harm organ function over time. Bioactive peptides offer a safer alternative, avoiding such adverse effects. Computational methods for predicting antidiabetic peptides (ADPs) can significantly reduce the time and cost of experimental testing. While machine learning (ML) has been applied to identify ADPs, advancements in data analysis and algorithms continue to drive progress in the field. To address this, we developed AntiT2DMP-Pred, the first ML-based tool specifically designed for predicting type 2 antidiabetic peptides (T2ADPs). This tool employs a feature fusion strategy, combining ten highly discriminative feature descriptors chosen from a pool of 32 descriptors and eight ML algorithms, tested across a range of baseline models. AntiT2DMP-Pred demonstrated excellent performance, surpassing both baseline and feature-optimized models, with an accuracy (ACC) and Matthews' correlation coefficient (MCC) of 0.976 and 0.953 on the training dataset, and an ACC and MCC of 0.957 and 0.851 on the independent dataset. The web server (https://balalab-skku.org/AntiT2DMP-Pred) is freely accessible, enabling researchers worldwide to utilize it in their experimental workflows and contribute to the discovery and understanding of T2ADPs, ultimately supporting peptide-based therapeutic development for diabetes management.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499 Republic of Korea.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419 Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499 Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon 16499 Republic of Korea.
| |
Collapse
|
18
|
Yang C, Hu X, Feng Z, Hao S, Zhang G, Chen S, Guo G. The optimised model of predicting protein-metal ion ligand binding residues. IET Syst Biol 2025; 19:e70001. [PMID: 39873344 PMCID: PMC11773433 DOI: 10.1049/syb2.70001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 12/27/2024] [Accepted: 01/08/2025] [Indexed: 01/30/2025] Open
Abstract
Metal ions are significant ligands that bind to proteins and play crucial roles in cell metabolism, material transport, and signal transduction. Predicting the protein-metal ion ligand binding residues (PMILBRs) accurately is a challenging task in theoretical calculations. In this study, the authors employed fused amino acids and their derived information as feature parameters to predict PMILBRs using three classical machine learning algorithms, yielding favourable prediction results. Subsequently, deep learning algorithm was incorporated in the prediction, resulting in improved results for the sets of Ca2+ and Mg2+ compared to previous studies. The validation matrix provided the optimal prediction model for each ionic ligand binding residue, exhibiting the capability of effectively predicting the binding sites of metal ion ligands for real protein chains.
Collapse
Affiliation(s)
- Caiyun Yang
- College of SciencesInner Mongolia University of TechnologyHohhotChina
| | - Xiuzhen Hu
- College of SciencesInner Mongolia University of TechnologyHohhotChina
| | - Zhenxing Feng
- College of SciencesInner Mongolia University of TechnologyHohhotChina
| | - Sixi Hao
- College of SciencesInner Mongolia University of TechnologyHohhotChina
| | | | - Shaohua Chen
- College of SciencesInner Mongolia University of TechnologyHohhotChina
| | - Guodong Guo
- School of Computer Science and TechnologyBaotou Medical CollegeBaotouChina
| |
Collapse
|
19
|
Ghulam A, Arif M, Unar A, A. Thafar M, Albaradei S, Worachartcheewan A. StackAHTPs: An explainable antihypertensive peptides identifier based on heterogeneous features and stacked learning approach. IET Syst Biol 2025; 19:e70002. [PMID: 39905861 PMCID: PMC11794993 DOI: 10.1049/syb2.70002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 12/30/2024] [Accepted: 01/15/2025] [Indexed: 02/06/2025] Open
Abstract
Hypertension, often known as high blood pressure, is a major concern to millions of individuals globally. Recent studies have demonstrated the significant efficacy of naturally derived peptides in reducing blood pressure. Hypertension is one of the risks associated with cardiovascular disorders and other health problems. Naturally sourced bioactive peptides possessing antihypertensive properties provide considerable potential as viable substitutes for conventional pharmaceutical medications. Currently, thorough examination of antihypertensive peptide (AHTPs), by using traditional wet-lab methods is highly expensive and labours. Therefore, in-silico approaches especially machine-learning (ML) algorithms are favourable due to saving time and cost in the discovery of AHTPs. In this study, a novel ML-based predictor, called StackAHTP was developed for predicting accurate AHTPs from sequence only. The proposed method, utilise two types of feature descriptors Pseudo-Amino Acid Composition and Dipeptide Composition to encode the local and global hidden information from peptide sequences. Furthermore, the encoded features are serially merged and ranked through SHapley Additive explanations (SHAP) algorithm. Then, the top ranked are fed into three different ensemble classifiers (Bagging, Boosting, and Stacking) for enhancing the prediction performance of the model. The StackAHTPs method achieved superior performance compare to other ML classifiers (AdaBoost, XGBoost and Light Gradient Boosting (LightGBM), Bagging and Boosting) on 10-fold cross validation and independent test. The experimental outcomes demonstrate that our proposed method outperformed the existing methods and achieved an accuracy of 92.25% and F1-score of 89.67% on independent test for predicting AHTPs and non-AHTPs. The authors believe this research will remarkably contribute in predicting large-scale characterisation of AHTPs and accelerate the drug discovery process. At https://github.com/ali-ghulam/StackAHTPs you may find datasets features used.
Collapse
Affiliation(s)
- Ali Ghulam
- Information Technology CentreSindh Agriculture UniversityTandojamSindhPakistan
| | - Muhammad Arif
- College of Science and EngineeringHamad Bin Khalifa UniversityDohaQatar
| | - Ahsanullah Unar
- Department of Precision MedicineUniversity of Campania ‘L. Vanvitelli’NaplesItaly
| | - Maha A. Thafar
- Department of Computer ScienceCollege of Computers and Information TechnologyTaif UniversityTaifSaudi Arabia
| | - Somayah Albaradei
- Department of Computer ScienceFaculty of Computing and Information TechnologyKing Abdulaziz UniversityJeddahSaudi Arabia
| | - Apilak Worachartcheewan
- Department of Community Medical TechnologyFaculty of Medical TechnologyMahidol UniversityBangkokThailand
| |
Collapse
|
20
|
Cao Y, Li X, Chen X, Xu K, Zhang J, Lin H, Liu Y. Identification of Eight Histone Methylation Modification Regulators Associated With Breast Cancer Prognosis. IET Syst Biol 2025; 19:e70012. [PMID: 40260909 PMCID: PMC12012758 DOI: 10.1049/syb2.70012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 02/10/2025] [Accepted: 02/23/2025] [Indexed: 04/24/2025] Open
Abstract
Histone methylation is an important epigenetic modification process coordinated by histone methyltransferases, histone demethylases and histone methylation reader proteins and plays a key role in the occurrence and development of cancer. This study constructed a risk scoring model around histone methylation modification regulators and conducted a multidimensional comprehensive analysis to reveal its potential role in breast cancer prognosis and drug sensitivity. First, 144 histone methylation modification regulators (HMMRs) were subjected to differential analysis and univariate Cox regression analysis, and nine differentially expressed HMMRs associated with survival were screened out. Next, a risk scoring model consisting of eight HMMRs was constructed using the LASSO regression algorithm, exhibiting independent predictive values in training and validation cohorts. Then, immune analysis shows that patients in the high-risk group divided by the risk scoring model has weakened the immune response. In addition, through functional analysis of differentially expressed genes (DEGs) between high-risk and low-risk groups, we confirmed that the DEGs mainly affected the nucleoplasm and tumour microenvironment. Finally, drug sensitivity analysis demonstrated that our model could be useful for drug screening and identify potential drugs for treating BRCA patients. In conclusion, these eight HMMRs may be key factors in the prognosis and drug sensitivity of BRCA patients.
Collapse
Affiliation(s)
- Yan‐Ni Cao
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| | - Xiao‐Hui Li
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| | - Xing‐Jie Chen
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| | - Kang‐Cheng Xu
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| | - Jun‐Yuan Zhang
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| | - Hao Lin
- School of Life Sciences and TechnologyCenter for Informational BiologyUniversity of Electronic Science and Technology of ChinaChengduChina
| | - Yu‐Xian Liu
- School of Artificial IntelligenceAnhui University of Science and TechnologyHuainanChina
| |
Collapse
|
21
|
Meng C, Hou Y, Zou Q, Shi L, Su X, Ju Y. Rore: robust and efficient antioxidant protein classification via a novel dimensionality reduction strategy based on learning of fewer features. Genomics Inform 2024; 22:29. [PMID: 39633440 PMCID: PMC11616364 DOI: 10.1186/s44342-024-00026-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Accepted: 10/03/2024] [Indexed: 12/07/2024] Open
Abstract
In protein identification, researchers increasingly aim to achieve efficient classification using fewer features. While many feature selection methods effectively reduce the number of model features, they often cause information loss caused by merely selecting or discarding features, which limits classifier performance. To address this issue, we present Rore, an algorithm based on a feature-dimensionality reduction strategy. By mapping the original features to a latent space, Rore retains all relevant feature information while using fewer representations of the latent features. This approach significantly preserves the original information and overcomes the information loss problem associated with previous feature selection. Through extensive experimental validation and analysis, Rore demonstrated excellent performance on an antioxidant protein dataset, achieving an accuracy of 95.88% and MCC of 91.78%, using vectors including only 15 features. The Rore algorithm is available online at http://112.124.26.17:8021/Rore .
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
- Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot, China
| | - Yongqi Hou
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Huangpu District, No. 415, Fengyang Road, Shanghai, China
| | - Xi Su
- Foshan Women and Children Hospital, Foshan, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China.
| |
Collapse
|
22
|
Yong X, Hu X, Kang T, Deng Y, Li S, Yu S, Hou Y, You J, Dai X, Zhang J, Zhang J, Zhou J, Zhang S, Zheng J, Yang Q, Li J. Identification of CCR7 and CBX6 as key biomarkers in abdominal aortic aneurysm: Insights from multi-omics data and machine learning analysis. IET Syst Biol 2024; 18:250-260. [PMID: 39602349 DOI: 10.1049/syb2.12106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 09/30/2024] [Accepted: 10/25/2024] [Indexed: 11/29/2024] Open
Abstract
Abdominal aortic aneurysm (AAA) is a severe vascular condition, marked by the progressive dilation of the abdominal aorta, leading to rupture if untreated. The objective of this study was to identify key biomarkers and decipher the immune mechanisms underlying AAA utilising multi-omics data analysis and machine learning techniques. Single-cell RNA sequencing disclosed a heightened presence of macrophages and CD8-positive alpha-beta T cells in AAA, highlighting their critical role in disease pathogenesis. Analysis of cell-cell communication highlighted augmented interactions between macrophages and dendritic cells derived from monocytes. Enrichment analysis of differential expression gene indicated substantial involvement of immune and metabolic pathways in AAA pathogenesis. Machine learning techniques identified CCR7 and CBX6 as key candidate biomarkers. In AAA, CCR7 expression is upregulated, whereas CBX6 expression is downregulated, both showing significant correlations with immune cell infiltration. These findings provide valuable insights into the molecular mechanisms underlying AAA and suggest potential biomarkers for diagnosis and therapeutic intervention.
Collapse
Affiliation(s)
- Xi Yong
- Department of Vascular Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
- The First Affiliated Hospital, Jinan University, Guangzhou, China
- Hepatobiliary, Pancreatic and Intestinal Research Institute of North Sichuan Medical College, Nanchong, China
| | - Xuerui Hu
- Department of Endocrine, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Tengyao Kang
- Department of Vascular Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Yanpiao Deng
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Sixuan Li
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Shuihan Yu
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Yani Hou
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Jin You
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Xiaohe Dai
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Jialin Zhang
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Junjia Zhang
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Junlin Zhou
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Siyu Zhang
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Jianghua Zheng
- Department of Vascular Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| | - Qin Yang
- Department of Infectious Diseases, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Jingdong Li
- Hepatobiliary, Pancreatic and Intestinal Research Institute of North Sichuan Medical College, Nanchong, China
- Department of Clinical Medicine, North Sichuan Medical College, Nanchong, China
| |
Collapse
|
23
|
Zhang Y, Yang T, Yang Y, Xu D, Hu Y, Zhang S, Luo N, Ning L, Ren L. siRNAEfficacyDB: An experimentally supported small interfering RNA efficacy database. IET Syst Biol 2024; 18:199-207. [PMID: 39541343 DOI: 10.1049/syb2.12102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/26/2024] [Accepted: 10/19/2024] [Indexed: 11/16/2024] Open
Abstract
Small interfering RNA (siRNA) has revolutionised biomedical research and drug development through precise post-transcriptional gene silencing technology. Despite its immense potential, siRNA therapy still faces technical challenges, such as delivery efficiency, targeting specificity, and molecular stability. To address these challenges and facilitate siRNA drug development, we have developed siRNAEfficacyDB, a comprehensive database that integrates experimentally validated siRNA efficacy data. This database contains 3544 siRNA records, covering 42 target genes and 5 cell lines. It provides detailed information on siRNA sequences, target genes, inhibition efficiencies, experimental techniques, cell lines, siRNA concentrations, and incubation times. siRNAEfficacyDB offers a user-friendly web interface that makes it easy to query, browse and analyse data, enabling efficient access to siRNA-related information. In summary, siRNAEfficacyDB provides a useful data foundation for siRNA drug design and optimisation, serving as a valuable resource for advancing computer-aided siRNA design research and nucleic acid drug development. siRNAEfficacyDB is freely available at https://cellknowledge.com.cn/siRNAEfficacy for non-commercial use.
Collapse
Affiliation(s)
- Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Ting Yang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Yu Yang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Dongsheng Xu
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Yucheng Hu
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Shuo Zhang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Nanchao Luo
- School of Computer Science and Technology, Aba Teachers College, Aba, Sichuan, China
| | - Lin Ning
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Liping Ren
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| |
Collapse
|
24
|
Wu CY, Xu ZX, Li N, Qi DY, Hao ZH, Wu HY, Gao R, Jin YT. Accurately identifying positive and negative regulation of apoptosis using fusion features and machine learning methods. Comput Biol Chem 2024; 113:108207. [PMID: 39265463 DOI: 10.1016/j.compbiolchem.2024.108207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 08/20/2024] [Accepted: 09/06/2024] [Indexed: 09/14/2024]
Abstract
Apoptotic proteins play a crucial role in the apoptosis process, ensuring a balance between cell proliferation and death. Thus, further elucidating the regulatory mechanisms of apoptosis will enhance our understanding of their functions. However, the development of computational methods to accurately identify positive and negative regulation of apoptosis remains a significant challenge. This work proposes a machine learning model based on multi-feature fusion to effectively identify the roles of positive and negative regulation of apoptosis. Initially, we constructed a reliable benchmark dataset containing 200 positive regulation of apoptosis and 241 negative regulation of apoptosis proteins. Subsequently, we developed a classifier that combines the support vector machine (SVM) with pseudo composition of k-spaced amino acid pairs (PseCKSAAP), composition transition distribution (CTD), dipeptide deviation from expected mean (DDE), and PSSM-composition to identify these proteins. Analysis of variance (ANOVA) was employed to select optimized features that could yield the maximum prediction performance. Evaluating the proposed model on independent data revealed and achieved an accuracy of 0.781 with an AUROC of 0.837, demonstrating our model's potent capabilities.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Xue Xu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Nan Li
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Dan-Yang Qi
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Hong Hao
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Hong-Ye Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Ru Gao
- The People's Hospital of Wenjiang, Chengdu, Sichuan 611130, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
25
|
Zhang ZY, Fan YE, Huang CB, Du MZ. Human essential gene identification based on feature fusion and feature screening. IET Syst Biol 2024; 18:227-237. [PMID: 39578676 DOI: 10.1049/syb2.12105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 09/11/2024] [Accepted: 10/25/2024] [Indexed: 11/24/2024] Open
Abstract
Essential genes are necessary to sustain the life of a species under adequate nutritional conditions. These genes have attracted significant attention for their potential as drug targets, especially in developing broad-spectrum antibacterial drugs. However, studying essential genes remains challenging due to their variability in specific environmental conditions. In this study, the authors aim to develop a powerful prediction model for identifying essential genes in humans. The authors first obtained the essential gene data from human cancer cell lines and characterised gene sequences using 7 feature encoding methods such as Kmer, the Composition of K-spaced Nucleic Acid Pairs, and Z-curve. Subsequently, feature fusion and feature optimisation strategies were employed to select the impactful features. Finally, machine learning algorithms were applied to construct the prediction models and evaluate their performance. The single-feature-based model achieved the highest area under the Receiver Operating Characteristic curve (AUC) of 0.830. After fusing and filtering these features, the classical machine learning models achieved the highest AUC at 0.823 while the deep learning model reached 0.860. Results obtained by the authors show that compared to using individual features, feature fusion and feature optimisation strategies significantly improved model performance. Moreover, the study provided an advantageous method for essential gene identification compared to other methods.
Collapse
Affiliation(s)
- Zhao-Yue Zhang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
- School of Medicine, University of Electronic Science and Technology of China, Chengdu, China
| | - Yue-Er Fan
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, ABa Teachers University, Chengdu, China
| | - Meng-Ze Du
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| |
Collapse
|
26
|
Li J, He S, Zhang J, Zhang F, Zou Q, Ni F. T4Seeker: a hybrid model for type IV secretion effectors identification. BMC Biol 2024; 22:259. [PMID: 39543674 PMCID: PMC11566746 DOI: 10.1186/s12915-024-02064-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 11/06/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND The type IV secretion system is widely present in various bacteria, such as Salmonella, Escherichia coli, and Helicobacter pylori. These bacteria use the type IV secretion system to secrete type IV secretion effectors, infect host cells, and disrupt or modulate the communication pathways. In this study, type III and type VI secretion effectors were used as negative samples to train a robust model. RESULTS The area under the curve of T4Seeker on the validation and independent test sets were 0.947 and 0.970, respectively, demonstrating the strong predictive capacity and robustness of T4Seeker. After comparing with the classic and state-of-the-art T4SE identification models, we found that T4Seeker, which is based on traditional features and large language model features, had a higher predictive ability. CONCLUSION The T4Seeker proposed in this study demonstrates superior performance in the field of T4SEs prediction. By integrating features at multiple levels, it achieves higher predictive accuracy and strong generalization capability, providing an effective tool for future T4SE research.
Collapse
Affiliation(s)
- Jing Li
- Department of Microbiology, University of Hong Kong, Hong Kong, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang, China
- School of Biomedical Sciences, University of Hong Kong, Hong Kong, China
| | - Shida He
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang, China
- The Joint Innovation Center for Engineering in Medicine, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 324000, China
- Department of Respiratory and Critical Care, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, 324000, China
| | - Jian Zhang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang, China
| | - Feng Zhang
- The Joint Innovation Center for Engineering in Medicine, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, 324000, China
- Department of Respiratory and Critical Care, Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou, 324000, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang, China
| | - Fengming Ni
- Department of Gastroenterology, The First Hospital of Jilin University, Changchun, 130021, China.
| |
Collapse
|
27
|
Ke S, Huang Y, Wang D, Jiang Q, Luo Z, Li B, Yan D, Zhou J. BreCML: identifying breast cancer cell state in scRNA-seq via machine learning. Front Med (Lausanne) 2024; 11:1482726. [PMID: 39574916 PMCID: PMC11579858 DOI: 10.3389/fmed.2024.1482726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2024] [Accepted: 10/15/2024] [Indexed: 11/24/2024] Open
Abstract
Breast cancer is a prevalent malignancy and one of the leading causes of cancer-related mortality among women worldwide. This disease typically manifests through the abnormal proliferation and dissemination of malignant cells within breast tissue. Current diagnostic and therapeutic strategies face significant challenges in accurately identifying and localizing specific subtypes of breast cancer. In this study, we developed a novel machine learning-based predictor, BreCML, designed to accurately classify subpopulations of breast cancer cells and their associated marker genes. BreCML exhibits outstanding predictive performance, achieving an accuracy of 98.92% on the training dataset. Utilizing the XGBoost algorithm, BreCML demonstrates superior accuracy (98.67%), precision (99.15%), recall (99.49%), and F1-score (99.79%) on the test dataset. Through the application of machine learning and feature selection techniques, BreCML successfully identified new key genes. This predictor not only serves as a powerful tool for assessing breast cancer cellular status but also offers a rapid and efficient means to uncover potential biomarkers, providing critical insights for precision medicine and therapeutic strategies.
Collapse
Affiliation(s)
- Shanbao Ke
- Department of Oncology, Henan Provincial People’s Hospital, Zhengzhou University People’s Hospital, Zhengzhou, China
| | - Yuxuan Huang
- Department of Neuroscience in the Behavioral Sciences, Duke University and Duke Kunshan University, Suzhou, China
| | - Dong Wang
- Pudong Institute for Health Development, Shanghai, China
| | - Qiang Jiang
- Department of Oncology, Henan Provincial People’s Hospital, Zhengzhou University People’s Hospital, Zhengzhou, China
| | - Zhangyang Luo
- Pudong Institute for Health Development, Shanghai, China
| | - Baiyu Li
- Department of Oncology, Henan Provincial People’s Hospital, Zhengzhou University People’s Hospital, Zhengzhou, China
| | - Danfang Yan
- Department of Radiation Oncology, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China
| | - Jianwei Zhou
- Department of Oncology, Henan Provincial People’s Hospital, Zhengzhou University People’s Hospital, Zhengzhou, China
| |
Collapse
|
28
|
Yu S, Liu L, Wang H, Yan S, Zheng S, Ning J, Luo R, Fu X, Deng X. AtML: An Arabidopsis thaliana root cell identity recognition tool for medicinal ingredient accumulation. Methods 2024; 231:61-69. [PMID: 39293728 DOI: 10.1016/j.ymeth.2024.09.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Revised: 08/05/2024] [Accepted: 09/12/2024] [Indexed: 09/20/2024] Open
Abstract
Arabidopsis thaliana synthesizes various medicinal compounds, and serves as a model plant for medicinal plant research. Single-cell transcriptomics technologies are essential for understanding the developmental trajectory of plant roots, facilitating the analysis of synthesis and accumulation patterns of medicinal compounds in different cell subpopulations. Although methods for interpreting single-cell transcriptomics data are rapidly advancing in Arabidopsis, challenges remain in precisely annotating cell identity due to the lack of marker genes for certain cell types. In this work, we trained a machine learning system, AtML, using sequencing datasets from six cell subpopulations, comprising a total of 6000 cells, to predict Arabidopsis root cell stages and identify biomarkers through complete model interpretability. Performance testing using an external dataset revealed that AtML achieved 96.50% accuracy and 96.51% recall. Through the interpretability provided by AtML, our model identified 160 important marker genes, contributing to the understanding of cell type annotations. In conclusion, we trained AtML to efficiently identify Arabidopsis root cell stages, providing a new tool for elucidating the mechanisms of medicinal compound accumulation in Arabidopsis roots.
Collapse
Affiliation(s)
- Shicong Yu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Lijia Liu
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Hao Wang
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Shen Yan
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Shuqin Zheng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Jing Ning
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Ruxian Luo
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Xiangzheng Fu
- Research Institute of Hunan University in Chongqing, Chongqing 401120, China.
| | - Xiaoshu Deng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China; Chongqing Academy of Chinese Materia Medica, Chongqing 400065, China.
| |
Collapse
|
29
|
Wei L. Advanced deep learning approaches enable high-throughput biological and biomedicine data analysis. Methods 2024; 230:116-118. [PMID: 39154807 DOI: 10.1016/j.ymeth.2024.08.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/20/2024] Open
Affiliation(s)
- Leyi Wei
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao; School of Informatics, Xiamen University, Xiamen, China.
| |
Collapse
|
30
|
Arif M, Musleh S, Ghulam A, Fida H, Alqahtani Y, Alam T. StackDPPred: Multiclass prediction of defensin peptides using stacked ensemble learning with optimized features. Methods 2024; 230:129-139. [PMID: 39173785 DOI: 10.1016/j.ymeth.2024.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 07/30/2024] [Accepted: 08/13/2024] [Indexed: 08/24/2024] Open
Abstract
Host defense or antimicrobial peptides (AMPs) are promising candidates for protecting host against microbial pathogens for example bacteria, virus, fungi, yeast. Defensins are the type of AMPs that act as potential therapeutic drug agent and perform vital role in various biological process. Conventional Experiments to identify defensin peptides (DPs) are time consuming and expensive. Thus, the shortcomings of wet lab experiments are leveraged by computational methods to accurately predict the functional types of DPs. In this paper, we aim to propose a novel multi-class ensemble-based prediction model called StackDPPred for identifying the properties of DPs. The peptide sequences are encoded using split amino acid composition (SAAC), segmented position specific scoring matrix (SegPSSM), histogram of oriented gradients-based PSSM (HOGPSSM) and feature extraction based graphical and statistical (FEGS) descriptors. Next, principal component analysis (PCA) is used to select the best subset of attributes. After that, the optimized features are fed into single machine learning and stacking-based ensemble classifiers. Furthermore, the ablation study demonstrates the robustness and efficacy of the stacking approach using reduced features for predicting DPs and their families. The proposed StackDPPred method improves the overall accuracy by 13.41% and 7.62% compared to existing DPs predictors iDPF-PseRAAC and iDEF-PseRAAC, respectively on validation test. Additionally, we applied the local interpretable model-agnostic explanations (LIME) algorithm to understand the contribution of selected features to the overall prediction. We believe, StackDPPred could serve as a valuable tool accelerating the screening of large-scale DPs and peptide-based drug discovery process.
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Huma Fida
- Department of Microbiology, Abdul Wali Khan University Mardan, 23200, KPK, Pakistan
| | | | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
31
|
Ahmed Z, Shahzadi K, Temesgen SA, Ahmad B, Chen X, Ning L, Zulfiqar H, Lin H, Jin YT. A protein pre-trained model-based approach for the identification of the liquid-liquid phase separation (LLPS) proteins. Int J Biol Macromol 2024; 277:134146. [PMID: 39067723 DOI: 10.1016/j.ijbiomac.2024.134146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/06/2024] [Accepted: 07/23/2024] [Indexed: 07/30/2024]
Abstract
Liquid-liquid phase separation (LLPS) regulates many biological processes including RNA metabolism, chromatin rearrangement, and signal transduction. Aberrant LLPS potentially leads to serious diseases. Therefore, the identification of the LLPS proteins is crucial. Traditionally, biochemistry-based methods for identifying LLPS proteins are costly, time-consuming, and laborious. In contrast, artificial intelligence-based approaches are fast and cost-effective and can be a better alternative to biochemistry-based methods. Previous research methods employed word2vec in conjunction with machine learning or deep learning algorithms. Although word2vec captures word semantics and relationships, it might not be effective in capturing features relevant to protein classification, like physicochemical properties, evolutionary relationships, or structural features. Additionally, other studies often focused on a limited set of features for model training, including planar π contact frequency, pi-pi, and β-pairing propensities. To overcome such shortcomings, this study first constructed a reliable dataset containing 1206 protein sequences, including 603 LLPS and 603 non-LLPS protein sequences. Then a computational model was proposed to efficiently identify the LLPS proteins by perceiving semantic information of protein sequences directly; using an ESM2-36 pre-trained model based on transformer architecture in conjunction with a convolutional neural network. The model could achieve an accuracy of 85.68% and 89.67%, respectively on training data and test data, surpassing the accuracy of previous studies. The performance demonstrates the potential of our computational methods as efficient alternatives for identifying LLPS proteins.
Collapse
Affiliation(s)
- Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Kiran Shahzadi
- Department of Biotechnology, Women University of Azad Jammu and Kashmir, Bagh, Azad Kashmir, Pakistan.
| | - Sebu Aboma Temesgen
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Basharat Ahmad
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Xiang Chen
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Lin Ning
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China.
| | - Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| |
Collapse
|
32
|
Zhang W, Ding Y, Wei L, Guo X, Ni F. Therapeutic peptides identification via kernel risk sensitive loss-based k-nearest neighbor model and multi-Laplacian regularization. Brief Bioinform 2024; 25:bbae534. [PMID: 39438076 PMCID: PMC11495874 DOI: 10.1093/bib/bbae534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 08/30/2024] [Accepted: 10/08/2024] [Indexed: 10/25/2024] Open
Abstract
Therapeutic peptides are therapeutic agents synthesized from natural amino acids, which can be used as carriers for precisely transporting drugs and can activate the immune system for preventing and treating various diseases. However, screening therapeutic peptides using biochemical assays is expensive, time-consuming, and limited by experimental conditions and biological samples, and there may be ethical considerations in the clinical stage. In contrast, screening therapeutic peptides using machine learning and computational methods is efficient, automated, and can accurately predict potential therapeutic peptides. In this study, a k-nearest neighbor model based on multi-Laplacian and kernel risk sensitive loss was proposed, which introduces a kernel risk loss function derived from the K-local hyperplane distance nearest neighbor model as well as combining the Laplacian regularization method to predict therapeutic peptides. The findings indicated that the suggested approach achieved satisfactory results and could effectively predict therapeutic peptide sequences.
Collapse
Affiliation(s)
- Wenyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006 Xiyuan Avenue, High tech Zone, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Leyi Wei
- Macao Polytechnic University, Gomes Street, Macau Peninsula, Macau 999078, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Fengming Ni
- Department of Gastroenterology, The First Hospital of Jilin University, No. 71 Xinmin Street, Chaoyang District, Changchun 130021, China
| |
Collapse
|
33
|
Sangaraju VK, Pham NT, Wei L, Yu X, Manavalan B. mACPpred 2.0: Stacked Deep Learning for Anticancer Peptide Prediction with Integrated Spatial and Probabilistic Feature Representations. J Mol Biol 2024; 436:168687. [PMID: 39237191 DOI: 10.1016/j.jmb.2024.168687] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 05/28/2024] [Accepted: 06/20/2024] [Indexed: 09/07/2024]
Abstract
Anticancer peptides (ACPs), naturally occurring molecules with remarkable potential to target and kill cancer cells. However, identifying ACPs based solely from their primary amino acid sequences remains a major hurdle in immunoinformatics. In the past, several web-based machine learning (ML) tools have been proposed to assist researchers in identifying potential ACPs for further testing. Notably, our meta-approach method, mACPpred, introduced in 2019, has significantly advanced the field of ACP research. Given the exponential growth in the number of characterized ACPs, there is now a pressing need to create an updated version of mACPpred. To develop mACPpred 2.0, we constructed an up-to-date benchmarking dataset by integrating all publicly available ACP datasets. We employed a large-scale of feature descriptors, encompassing both conventional feature descriptors and advanced pre-trained natural language processing (NLP)-based embeddings. We evaluated their ability to discriminate between ACPs and non-ACPs using eleven different classifiers. Subsequently, we employed a stacked deep learning (SDL) approach, incorporating 1D convolutional neural network (1D CNN) blocks and hybrid features. These features included the top seven performing NLP-based features and 90 probabilistic features, allowing us to identify hidden patterns within these diverse features and improve the accuracy of our ACP prediction model. This is the first study to integrate spatial and probabilistic feature representations for predicting ACPs. Rigorous cross-validation and independent tests conclusively demonstrated that mACPpred 2.0 not only surpassed its predecessor (mACPpred) but also outperformed the existing state-of-the-art predictors, highlighting the importance of advanced feature representation capabilities attained through SDL. To facilitate widespread use and accessibility, we have developed a user-friendly for mACPpred 2.0, available at https://balalab-skku.org/mACPpred2/.
Collapse
Affiliation(s)
- Vinoth Kumar Sangaraju
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, Macau
| | - Xue Yu
- Beidahuang Industry Group General Hospital, 150001 Harbin, China.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
34
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
35
|
Lilhore UK, Simiaya S, Alhussein M, Faujdar N, Dalal S, Aurangzeb K. Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis. BMC Med Inform Decis Mak 2024; 24:236. [PMID: 39192227 DOI: 10.1186/s12911-024-02631-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 08/07/2024] [Indexed: 08/29/2024] Open
Abstract
Efforts to enhance the accuracy of protein sequence classification are of utmost importance in driving forward biological analyses and facilitating significant medical advancements. This study presents a cutting-edge model called ProtICNN-BiLSTM, which combines attention-based Improved Convolutional Neural Networks (ICNN) and Bidirectional Long Short-Term Memory (BiLSTM) units seamlessly. Our main goal is to improve the accuracy of protein sequence classification by carefully optimizing performance through Bayesian Optimisation. ProtICNN-BiLSTM combines the power of CNN and BiLSTM architectures to effectively capture local and global protein sequence dependencies. In the proposed model, the ICNN component uses convolutional operations to identify local patterns. Captures long-range associations by analyzing sequence data forward and backwards. In advanced biological studies, Bayesian Optimisation optimizes model hyperparameters for efficiency and robustness. The model was extensively confirmed with PDB-14,189 and other protein data. We found that ProtICNN-BiLSTM outperforms traditional categorization models. Bayesian Optimization's fine-tuning and seamless integration of local and global sequence information make it effective. The precision of ProtICNN-BiLSTM improves comparative protein sequence categorization. The study improves computational bioinformatics for complex biological analysis. Good results from the ProtICNN-BiLSTM model improve protein sequence categorization. This powerful tool could improve medical and biological research. The breakthrough protein sequence classification model is ProtICNN-BiLSTM. Bayesian optimization, ICNN, and BiLSTM analyze biological data accurately.
Collapse
Affiliation(s)
- Umesh Kumar Lilhore
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Sarita Simiaya
- School of Computing Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Musaed Alhussein
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| | - Neetu Faujdar
- Department of Computer Engineering and Applications, GLA University, 281406, UP, Mathura, India
| | | | - Khursheed Aurangzeb
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh, 11543, Saudi Arabia
| |
Collapse
|
36
|
Zuo Y, Zhang B, Dong Y, He W, Bi Y, Liu X, Zeng X, Deng Z. Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism. J Chem Inf Model 2024; 64:6699-6711. [PMID: 39121059 DOI: 10.1021/acs.jcim.4c01034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2024]
Abstract
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yinkang Dong
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| |
Collapse
|
37
|
Zhu H, Hao H, Yu L. Identification of microbe-disease signed associations via multi-scale variational graph autoencoder based on signed message propagation. BMC Biol 2024; 22:172. [PMID: 39148051 PMCID: PMC11328394 DOI: 10.1186/s12915-024-01968-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 08/01/2024] [Indexed: 08/17/2024] Open
Abstract
BACKGROUND Plenty of clinical and biomedical research has unequivocally highlighted the tremendous significance of the human microbiome in relation to human health. Identifying microbes associated with diseases is crucial for early disease diagnosis and advancing precision medicine. RESULTS Considering that the information about changes in microbial quantities under fine-grained disease states helps to enhance a comprehensive understanding of the overall data distribution, this study introduces MSignVGAE, a framework for predicting microbe-disease sign associations using signed message propagation. MSignVGAE employs a graph variational autoencoder to model noisy signed association data and extends the multi-scale concept to enhance representation capabilities. A novel strategy for propagating signed message in signed networks addresses heterogeneity and consistency among nodes connected by signed edges. Additionally, we utilize the idea of denoising autoencoder to handle the noise in similarity feature information, which helps overcome biases in the fused similarity data. MSignVGAE represents microbe-disease associations as a heterogeneous graph using similarity information as node features. The multi-class classifier XGBoost is utilized to predict sign associations between diseases and microbes. CONCLUSIONS MSignVGAE achieves AUROC and AUPR values of 0.9742 and 0.9601, respectively. Case studies on three diseases demonstrate that MSignVGAE can effectively capture a comprehensive distribution of associations by leveraging signed information.
Collapse
Affiliation(s)
- Huan Zhu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Hongxia Hao
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
38
|
Zhao Y, Jin J, Gao W, Qiao J, Wei L. Moss-m7G: A Motif-Based Interpretable Deep Learning Method for RNA N7-Methlguanosine Site Prediction. J Chem Inf Model 2024; 64:6230-6240. [PMID: 39011571 DOI: 10.1021/acs.jcim.4c00802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
N-7methylguanosine (m7G) modification plays a crucial role in various biological processes and is closely associated with the development and progression of many cancers. Accurate identification of m7G modification sites is essential for understanding their regulatory mechanisms and advancing cancer therapy. Previous studies often suffered from insufficient research data, underutilization of motif information, and lack of interpretability. In this work, we designed a novel motif-based interpretable method for m7G modification site prediction, called Moss-m7G. This approach enables the analysis of RNA sequences from a motif-centric perspective. Our proposed word-detection module and motif-embedding module within Moss-m7G extract motif information from sequences, transforming the raw sequences from base-level into motif-level and generating embeddings for these motif sequences. Compared with base sequences, motif sequences contain richer contextual information, which is further analyzed and integrated through the Transformer model. We constructed a comprehensive m7G data set to implement the training and testing process to address the data insufficiency noted in prior research. Our experimental results affirm the effectiveness and superiority of Moss-m7G in predicting m7G modification sites. Moreover, the introduction of the word-detection module enhances the interpretability of the model, providing insights into the predictive mechanisms.
Collapse
Affiliation(s)
- Yanxi Zhao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wenjia Gao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
- School of Informatics, Xiamen University, Xiamen 361104, China
| |
Collapse
|
39
|
Zhang Y, Yang Y, Ren L, Ning L, Zou Q, Luo N, Zhang Y, Liu R. RDscan: Extracting RNA-disease relationship from the literature based on pre-training model. Methods 2024; 228:48-54. [PMID: 38789016 DOI: 10.1016/j.ymeth.2024.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/02/2024] [Accepted: 05/16/2024] [Indexed: 05/26/2024] Open
Abstract
With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.
Collapse
Affiliation(s)
- Yang Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China.
| | - Yu Yang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Liping Ren
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Lin Ning
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Nanchao Luo
- School of Computer Science and Technology, Aba Teachers College, WenChuan, Sichuan, 623002, China
| | - Yinghui Zhang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China.
| | - Ruijun Liu
- School of Software, Beihang University, Beijing 100191, China.
| |
Collapse
|
40
|
Arif M, Musleh S, Fida H, Alam T. PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation. Sci Rep 2024; 14:16992. [PMID: 39043738 PMCID: PMC11266708 DOI: 10.1038/s41598-024-67433-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Accepted: 07/11/2024] [Indexed: 07/25/2024] Open
Abstract
Anticancer peptides (ACPs) perform a promising role in discovering anti-cancer drugs. The growing research on ACPs as therapeutic agent is increasing due to its minimal side effects. However, identifying novel ACPs using wet-lab experiments are generally time-consuming, labor-intensive, and expensive. Leveraging computational methods for fast and accurate prediction of ACPs would harness the drug discovery process. Herein, a machine learning-based predictor, called PLMACPred, is developed for identifying ACPs from peptide sequence only. PLMACPred adopted a set of encoding schemes representing evolutionary-property, composition-property, and protein language model (PLM), i.e., evolutionary scale modeling (ESM-2)- and ProtT5-based embedding to encode peptides. Then, two-dimensional (2D) wavelet denoising (WD) was employed to remove the noise from extracted features. Finally, ensemble-based cascade deep forest (CDF) model was developed to identify ACP. PLMACPred model attained superior performance on all three benchmark datasets, namely, ACPmain, ACPAlter, and ACP740 over tenfold cross validation and independent dataset. PLMACPred outperformed the existing models and improved the prediction accuracy by 18.53%, 2.4%, 7.59% on ACPmain, ACPalter, ACP740 dataset, respectively. We showed that embedding from ProtT5 and ESM-2 was capable of capturing better contextual information from the entire sequence than the other encoding schemes for ACP prediction. For the explainability of proposed model, SHAP (SHapley Additive exPlanations) method was used to analyze the feature effect on the ACP prediction. A list of novel sequence motifs was proposed from the ACP sequence using MEME suites. We believe, PLMACPred will support in accelerating the discovery of novel ACPs as well as other activities of microbial peptides.
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Huma Fida
- Department of Microbiology, Abdul Wali Khan University, Mardan, KPK, Pakistan
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
41
|
Liu L, Huang Y, Zheng Y, Liao Y, Ma S, Wang Q. ScnML models single-cell transcriptome to predict spinal cord neuronal cell status. Front Genet 2024; 15:1413484. [PMID: 38894722 PMCID: PMC11183327 DOI: 10.3389/fgene.2024.1413484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 05/20/2024] [Indexed: 06/21/2024] Open
Abstract
Injuries to the spinal cord nervous system often result in permanent loss of sensory, motor, and autonomic functions. Accurately identifying the cellular state of spinal cord nerves is extremely important and could facilitate the development of new therapeutic and rehabilitative strategies. Existing experimental techniques for identifying the development of spinal cord nerves are both labor-intensive and costly. In this study, we developed a machine learning predictor, ScnML, for predicting subpopulations of spinal cord nerve cells as well as identifying marker genes. The prediction performance of ScnML was evaluated on the training dataset with an accuracy of 94.33%. Based on XGBoost, ScnML on the test dataset achieved 94.08% 94.24%, 94.26%, and 94.24% accuracies with precision, recall, and F1-measure scores, respectively. Importantly, ScnML identified new significant genes through model interpretation and biological landscape analysis. ScnML can be a powerful tool for predicting the status of spinal cord neuronal cells, revealing potential specific biomarkers quickly and efficiently, and providing crucial insights for precision medicine and rehabilitation recovery.
Collapse
Affiliation(s)
- Lijia Liu
- School of Recreation and Community Sport, Capital University of Physical Education and Sports, Beijing, China
| | - Yuxuan Huang
- Department of Neuroscience in the Behavioral Sciences, Duke University and Duke Kunshan University, Suzhou, Jiangsu, China
| | - Yuan Zheng
- Taizhou Hospital of Zhejiang Province, Wenzhou Medical University, Luqiao, China
| | - Yihan Liao
- Taizhou Hospital of Zhejiang Province, Wenzhou Medical University, Luqiao, China
| | - Siyuan Ma
- School of Recreation and Community Sport, Capital University of Physical Education and Sports, Beijing, China
| | - Qian Wang
- Department of Neurology, The First Hospital of Tsinghua University, Beijing, China
| |
Collapse
|
42
|
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol 2024; 22:126. [PMID: 38816885 PMCID: PMC11555825 DOI: 10.1186/s12915-024-01923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches. RESULTS In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability. CONCLUSIONS msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Collapse
Affiliation(s)
- Yazi Li
- School of Mathematics and Statistics, Hainan University, Haikou, 570228, China
| | - Xiaoman Wei
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Qinglin Yang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - An Xiong
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| |
Collapse
|
43
|
Jiao S, Ye X, Sakurai T, Zou Q, Liu R. Integrated convolution and self-attention for improving peptide toxicity prediction. Bioinformatics 2024; 40:btae297. [PMID: 38696758 PMCID: PMC11654579 DOI: 10.1093/bioinformatics/btae297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/02/2024] [Accepted: 04/30/2024] [Indexed: 05/04/2024] Open
Abstract
MOTIVATION Peptides are promising agents for the treatment of a variety of diseases due to their specificity and efficacy. However, the development of peptide-based drugs is often hindered by the potential toxicity of peptides, which poses a significant barrier to their clinical application. Traditional experimental methods for evaluating peptide toxicity are time-consuming and costly, making the development process inefficient. Therefore, there is an urgent need for computational tools specifically designed to predict peptide toxicity accurately and rapidly, facilitating the identification of safe peptide candidates for drug development. RESULTS We provide here a novel computational approach, CAPTP, which leverages the power of convolutional and self-attention to enhance the prediction of peptide toxicity from amino acid sequences. CAPTP demonstrates outstanding performance, achieving a Matthews correlation coefficient of approximately 0.82 in both cross-validation settings and on independent test datasets. This performance surpasses that of existing state-of-the-art peptide toxicity predictors. Importantly, CAPTP maintains its robustness and generalizability even when dealing with data imbalances. Further analysis by CAPTP reveals that certain sequential patterns, particularly in the head and central regions of peptides, are crucial in determining their toxicity. This insight can significantly inform and guide the design of safer peptide drugs. AVAILABILITY AND IMPLEMENTATION The source code for CAPTP is freely available at https://github.com/jiaoshihu/CAPTP.
Collapse
Affiliation(s)
- Shihu Jiao
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic
Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science
and Technology of China, Quzhou 324000, China
| | - Ruijun Liu
- School of Software, Beihang University, Beijing 100191,
China
| |
Collapse
|
44
|
Wei H, Gao L, Wu S, Jiang Y, Liu B. DiSMVC: a multi-view graph collaborative learning framework for measuring disease similarity. Bioinformatics 2024; 40:btae306. [PMID: 38715444 PMCID: PMC11256965 DOI: 10.1093/bioinformatics/btae306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/19/2024] [Accepted: 05/05/2024] [Indexed: 05/30/2024] Open
Abstract
MOTIVATION Exploring potential associations between diseases can help in understanding pathological mechanisms of diseases and facilitating the discovery of candidate biomarkers and drug targets, thereby promoting disease diagnosis and treatment. Some computational methods have been proposed for measuring disease similarity. However, these methods describe diseases without considering their latent multi-molecule regulation and valuable supervision signal, resulting in limited biological interpretability and efficiency to capture association patterns. RESULTS In this study, we propose a new computational method named DiSMVC. Different from existing predictors, DiSMVC designs a supervised graph collaborative framework to measure disease similarity. Multiple bio-entity associations related to genes and miRNAs are integrated via cross-view graph contrastive learning to extract informative disease representation, and then association pattern joint learning is implemented to compute disease similarity by incorporating phenotype-annotated disease associations. The experimental results show that DiSMVC can draw discriminative characteristics for disease pairs, and outperform other state-of-the-art methods. As a result, DiSMVC is a promising method for predicting disease associations with molecular interpretability. AVAILABILITY AND IMPLEMENTATION Datasets and source codes are available at https://github.com/Biohang/DiSMVC.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Shuai Wu
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Yina Jiang
- Department of Basic Medicine, Shaanxi University of Chinese Medicine, Xianyang, Shaanxi 712046, China
| | - Bin Liu
- Faculty of Engineering, Shenzhen MSU-BIT University, Shenzhen, Guangdong 518172, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| |
Collapse
|
45
|
Ma X, Li Z, Du Z, Xu Y, Chen Y, Zhuo L, Fu X, Liu R. Advancing cancer driver gene detection via Schur complement graph augmentation and independent subspace feature extraction. Comput Biol Med 2024; 174:108484. [PMID: 38643595 DOI: 10.1016/j.compbiomed.2024.108484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 03/18/2024] [Accepted: 04/15/2024] [Indexed: 04/23/2024]
Abstract
Accurately identifying cancer driver genes (CDGs) is crucial for guiding cancer treatment and has recently received great attention from researchers. However, the high complexity and heterogeneity of cancer gene regulatory networks limit the precition accuracy of existing deep learning models. To address this, we introduce a model called SCIS-CDG that utilizes Schur complement graph augmentation and independent subspace feature extraction techniques to effectively predict potential CDGs. Firstly, a random Schur complement strategy is adopted to generate two augmented views of gene network within a graph contrastive learning framework. Rapid randomization of the random Schur complement strategy enhances the model's generalization and its ability to handle complex networks effectively. Upholding the Schur complement principle in expectations promotes the preservation of the original gene network's vital structure in the augmented views. Subsequently, we employ feature extraction technology using multiple independent subspaces, each trained with independent weights to reduce inter-subspace dependence and improve the model's expressiveness. Concurrently, we introduced a feature expansion component based on the structure of the gene network to address issues arising from the limited dimensionality of node features. Moreover, it can alleviate the challenges posed by the heterogeneity of cancer gene networks to some extent. Finally, we integrate a learnable attention weight mechanism into the graph neural network (GNN) encoder, utilizing feature expansion technology to optimize the significance of various feature levels in the prediction task. Following extensive experimental validation, the SCIS-CDG model has exhibited high efficiency in identifying known CDGs and uncovering potential unknown CDGs in external datasets. Particularly when compared to previous conventional GNN models, its performance has seen significant improved. The code and data are publicly available at: https://github.com/mxqmxqmxq/SCIS-CDG.
Collapse
Affiliation(s)
- Xinqian Ma
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325027, Wenzhou, China
| | - Zhen Li
- School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, Guizhou 558000, China; Institute of Computational Science and Technology, Guangzhou University, 510000, Guangzhou, China
| | - Zhenya Du
- Guangzhou Xinhua University, 510520, Guangzhou, China
| | - Yan Xu
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325027, Wenzhou, China
| | - Yifan Chen
- College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan, 410004, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325027, Wenzhou, China.
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, 410012, Changsha, China
| | - Ruijun Liu
- School of Software, Beihang University, Beijing, China.
| |
Collapse
|
46
|
Chen W, Zhang Y, Wu W, Yang H, Huang W. Machine learning-based predictive model for abdominal diseases using physical examination datasets. Comput Biol Med 2024; 173:108249. [PMID: 38531251 DOI: 10.1016/j.compbiomed.2024.108249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 02/21/2024] [Accepted: 03/06/2024] [Indexed: 03/28/2024]
Abstract
Abdominal ultrasound is a key non-invasive imaging method for diagnosing liver, kidney, and gallbladder diseases, despite its clinical significance, not all individuals can undergo abdominal ultrasonography during routine health check-ups due to limitations in equipment, cost, and time. This study aims to use basic physical examination data to predict the risk of diseases of the liver, kidney, and gallbladder that can be diagnosed via abdominal ultrasound. Basic physical examination data contain gender, age, height, weight, BMI, pulse, systolic blood pressure (SBP), diastolic blood pressure (DBP), high-density lipoprotein (HDL), low-density lipoprotein (LDL), total cholesterol, triglycerides, fasting blood glucose (FBG), and uric acid-we established seven single-label predictive models and one multi-label predictive model. These models were specifically designed to predict a range of abdominal diseases. The single-label models, utilizing the XGBoost algorithm, targeted diseases such as fatty liver (with an Area Under the Curve (AUC) of 0.9344), liver deposits (AUC: 0.8221), liver cysts (AUC: 0.7928), gallbladder polyps (AUC: 0.7508), kidney stones (AUC: 0.7853), kidney cysts (AUC: 0.8241), and kidney crystals (AUC: 0.7536). Furthermore, a comprehensive multi-label model, capable of predicting multiple conditions simultaneously, was established by FCN and achieved an AUC of 0.6344. We conducted interpretability analysis on these models to enhance their understanding and applicability in clinical settings. The insights gained from this analysis are crucial for the development of targeted disease prevention strategies. This study represents a significant advancement in utilizing physical examination data to predict ultrasound results, offering a novel approach to early diagnosis and prevention of abdominal diseases.
Collapse
Affiliation(s)
- Wei Chen
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China; Four Provincial Marginal Traditional Chinese Medicine Hospitals (Quzhou Traditional Chinese Medicine Hospital) Affiliated to Zhejiang University of Traditional Chinese Medicine, Quzhou, China
| | - YuJie Zhang
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China
| | - Weili Wu
- Four Provincial Marginal Traditional Chinese Medicine Hospitals (Quzhou Traditional Chinese Medicine Hospital) Affiliated to Zhejiang University of Traditional Chinese Medicine, Quzhou, China
| | - Hui Yang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China.
| | - Wenxiu Huang
- Zhejiang Academy of Traditional Chinese Medicine Culture, Zhejiang Chinese Medical University, Hangzhou, China.
| |
Collapse
|
47
|
Ren L, Huang D, Liu H, Ning L, Cai P, Yu X, Zhang Y, Luo N, Lin H, Su J, Zhang Y. Applications of single‑cell omics and spatial transcriptomics technologies in gastric cancer (Review). Oncol Lett 2024; 27:152. [PMID: 38406595 PMCID: PMC10885005 DOI: 10.3892/ol.2024.14285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 01/19/2024] [Indexed: 02/27/2024] Open
Abstract
Gastric cancer (GC) is a prominent contributor to global cancer-related mortalities, and a deeper understanding of its molecular characteristics and tumor heterogeneity is required. Single-cell omics and spatial transcriptomics (ST) technologies have revolutionized cancer research by enabling the exploration of cellular heterogeneity and molecular landscapes at the single-cell level. In the present review, an overview of the advancements in single-cell omics and ST technologies and their applications in GC research is provided. Firstly, multiple single-cell omics and ST methods are discussed, highlighting their ability to offer unique insights into gene expression, genetic alterations, epigenomic modifications, protein expression patterns and cellular location in tissues. Furthermore, a summary is provided of key findings from previous research on single-cell omics and ST methods used in GC, which have provided valuable insights into genetic alterations, tumor diagnosis and prognosis, tumor microenvironment analysis, and treatment response. In summary, the application of single-cell omics and ST technologies has revealed the levels of cellular heterogeneity and the molecular characteristics of GC, and holds promise for improving diagnostics, personalized treatments and patient outcomes in GC.
Collapse
Affiliation(s)
- Liping Ren
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, Sichuan 611844, P.R. China
| | - Danni Huang
- Department of Radiology, Central South University Xiangya School of Medicine Affiliated Haikou People's Hospital, Haikou, Hainan 570208, P.R. China
| | - Hongjiang Liu
- School of Computer Science and Technology, Aba Teachers College, Aba, Sichuan 624099, P.R. China
| | - Lin Ning
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, Sichuan 611844, P.R. China
| | - Peiling Cai
- School of Basic Medical Sciences, Chengdu University, Chengdu, Sichuan 610106, P.R. China
| | - Xiaolong Yu
- Hainan Yazhou Bay Seed Laboratory, Sanya Nanfan Research Institute, Material Science and Engineering Institute of Hainan University, Sanya, Hainan 572025, P.R. China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, Sichuan 611137, P.R. China
| | - Nanchao Luo
- School of Computer Science and Technology, Aba Teachers College, Aba, Sichuan 624099, P.R. China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, P.R. China
| | - Jinsong Su
- Research Institute of Integrated Traditional Chinese Medicine and Western Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, Sichuan 611137, P.R. China
| | - Yinghui Zhang
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, Sichuan 611844, P.R. China
| |
Collapse
|
48
|
Zulfiqar H, Ahmad RM, Raza A, Shahzad S, Lin H. Promoter Prediction in Agrobacterium tumefaciens Strain C58 by Using Artificial Intelligence Strategies. Methods Mol Biol 2024; 2844:33-44. [PMID: 39068330 DOI: 10.1007/978-1-0716-4063-0_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Promoters are the genomic regions upstream of genes that RNA polymerase binds in order to initiate gene transcription. Understanding the regulation of gene expression depends on being able to identify promoters, because they are the most important component of gene expression. Agrobacterium tumefaciens (A. tumefaciens) strain C58 was the subject of this study with the goal of creating a machine learning-based model to predict promoters. In this study, nucleotide density (ND), k-mer, and one-hot were used to encode the promoter sequence. Support vector machine (SVM) on fivefold cross-validation with incremental feature selection (IFS) was used to optimize the generated features. These improved characteristics were then used to distinguish promoter sequences by feeding them into the random forest (RF) classifier. Tenfold cross-validation (CV) analysis revealed that the projected model has the ability to produce an accuracy of 84.22%.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
| | - Ramala Masood Ahmad
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad, Pakistan
| | - Ali Raza
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Sana Shahzad
- Institute of Horticultural Sciences, University of Agriculture Faisalabad, Faisalabad, Pakistan
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
| |
Collapse
|