1
|
Xiao Z, Li Y, Ding Y, Yu L. EPIPDLF: a pretrained deep learning framework for predicting enhancer-promoter interactions. Bioinformatics 2025; 41:btae716. [PMID: 40036975 PMCID: PMC12057809 DOI: 10.1093/bioinformatics/btae716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 11/04/2024] [Accepted: 02/26/2025] [Indexed: 03/06/2025] Open
Abstract
MOTIVATION Enhancers and promoters, as regulatory DNA elements, play pivotal roles in gene expression, homeostasis, and disease development across various biological processes. With advancing research, it has been uncovered that distal enhancers may engage with nearby promoters to modulate the expression of target genes. This discovery holds significant implications for deepening our comprehension of various biological mechanisms. In recent years, numerous high-throughput wet-lab techniques have been created to detect possible interactions between enhancers and promoters. However, these experimental methods are often time-intensive and costly. RESULTS To tackle this issue, we have created an innovative deep learning approach, EPIPDLF, which utilizes advanced deep learning techniques to predict EPIs based solely on genomic sequences in an interpretable manner. Comparative evaluations across six benchmark datasets demonstrate that EPIPDLF consistently exhibits superior performance in EPI prediction. Additionally, by incorporating interpretable analysis mechanisms, our model enables the elucidation of learned features, aiding in the identification and biological analysis of important sequences. AVAILABILITY AND IMPLEMENTATION The source code and data are available at: https://github.com/xzc196/EPIPDLF.
Collapse
Affiliation(s)
- Zhichao Xiao
- School of Computer Science and Technology, Xidian University, Xi'an 710075, China
| | - Yan Li
- School of Management, Xi'an Polytechnic University, Xi'an 710075, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710075, China
| |
Collapse
|
2
|
Temesgen SA, Ahmad B, Grace-Mercure BK, Liu M, Liu L, Lin H, Deng K. Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods. Methods 2025; 240:165-179. [PMID: 40280261 DOI: 10.1016/j.ymeth.2025.03.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2025] [Revised: 03/08/2025] [Accepted: 03/31/2025] [Indexed: 04/29/2025] Open
Abstract
The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.
Collapse
Affiliation(s)
- Sebu Aboma Temesgen
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Basharat Ahmad
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | | | - Minghao Liu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Kejun Deng
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
| |
Collapse
|
3
|
Sheng N, Qiao J, Wei L, Shi H, Guo H, Yang C. Computational models for prediction of m6A sites using deep learning. Methods 2025; 240:113-124. [PMID: 40268153 DOI: 10.1016/j.ymeth.2025.04.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 04/02/2025] [Accepted: 04/07/2025] [Indexed: 04/25/2025] Open
Abstract
RNA modifications play a crucial role in enhancing the structural and functional diversity of RNA molecules and regulating various stages of the RNA life cycle. Among these modifications, N6-Methyladenosine (m6A) is the most common internal modification in eukaryotic mRNAs and has been extensively studied over the past decade. Accurate identification of m6A modification sites is essential for understanding their function and underlying mechanisms. Traditional methods predominantly rely on machine learning techniques to recognize m6A sites, which often fail to capture the contextual features of these sites comprehensively. In this study, we comprehensively summarize previously published methods based on machine learning and deep learning. We also validate multiple deep learning approaches on benchmark dataset, including previously underutilized methods in m6A site prediction, pre-trained models specifically designed for biological sequence and other basic deep learning methods. Additionally, we further analyze the dataset features and interpret the model's predictions to enhance understanding. Our experimental results clearly demonstrate the effectiveness of the deep learning models, elucidating their strong potential in accurately recognizing m6A modification sites.
Collapse
Affiliation(s)
- Nan Sheng
- School of Software, Shandong University, Jinan 250101, PR China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, PR China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, PR China
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, PR China
| | - Huannan Guo
- Beidahuang Industry Group General Hospital, PR China.
| | - Changshun Yang
- Department of Gastrointestinal Surgery, Fuzhou University Affiliated Provincial Hospital, Fuzhou 350004, PR China.
| |
Collapse
|
4
|
Lu R, Qiao J, Li K, Zhao Y, Jin J, Cui F, Zhang Z, Manavalan B, Wei L. ERNIE-ac4C: A Novel Deep Learning Model for Effectively Predicting N4-acetylcytidine Sites. J Mol Biol 2025; 437:168978. [PMID: 39900287 DOI: 10.1016/j.jmb.2025.168978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 01/05/2025] [Accepted: 01/28/2025] [Indexed: 02/05/2025]
Abstract
RNA modifications are known to play a critical role in gene regulation and cellular processes. Specifically, N4-acetylcytidine (ac4C) modification has emerged as a significant marker involved in mRNA translation efficiency, stability, and various diseases. Accurate identification of ac4C modification sites is essential for unraveling its functional implications. However, currently available experimental methods suffer from drawbacks such as lengthy detection times, complexity, and high costs, resulting in low efficiency and accuracy in prediction. Although several bioinformatics methods have been proposed and have advanced the prediction of ac4C modification sites, there is still ample room for improvement. In this research, we propose a novel deep learning model, ERNIE-ac4C, which combines the ERNIE-RNA language model and a two-dimensional Convolutional Neural Network (CNN). ERNIE-ac4C utilizes the fusion of sequence features and attention map features to predict ac4C modification sites. ERNIE-ac4C surpasses other state-of-the-art deep learning methods, demonstrating superior accuracy and effectiveness. The availability of the code on GitHub (https://github.com/lrlbcxdd/ERNIEac4C.git) and our openness to feedback from the research community contribute to the model's accessibility and its potential for further advancements. Our study provides valuable insights into ac4C research and enhances our understanding of the functional consequences of RNA modifications.
Collapse
Affiliation(s)
- Ronglin Lu
- School of Software, Shandong University, Jinan 250101 China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101 China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101 China
| | - Kefei Li
- School of Software, Shandong University, Jinan 250101 China
| | - Yanxi Zhao
- School of Software, Shandong University, Jinan 250101 China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101 China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101 China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228 China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228 China
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419 Gyeonggi-do, Republic of Korea.
| | - Leyi Wei
- Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR, China.
| |
Collapse
|
5
|
Zuo Y, Fang X, Chen J, Ji J, Li Y, Wu Z, Liu X, Zeng X, Deng Z, Yin H, Zhao A. MlyPredCSED: based on extreme point deviation compensated clustering combined with cross-scale convolutional neural networks to predict multiple lysine sites in human. Brief Bioinform 2025; 26:bbaf189. [PMID: 40285360 PMCID: PMC12031725 DOI: 10.1093/bib/bbaf189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 03/27/2025] [Accepted: 04/03/2025] [Indexed: 04/29/2025] Open
Abstract
In post-translational modification, covalent bonds on lysine and attached chemical groups significantly change proteins' physical and chemical properties. They shape protein structures, enhance function and stability, and are vital for physiological processes, affecting health and disease through mechanisms like gene expression, signal transduction, protein degradation, and cell metabolism. Although lysine (K) modification sites are considered among the most common types of post-translational modifications in proteins, research on K-PTMs has largely overlooked the synergistic effects between different modifications and lacked the techniques to address the problem of sample imbalance. Based on this, the Extreme Point Deviation Compensated Clustering (EPDCC) Undersampling algorithm was proposed in this study and combined with Cross-Scale Convolutional Neural Networks (CSCNNs) to develop a novel computational tool, MlyPredCSED, for simultaneously predicting multiple lysine modification sites. MlyPredCSED employs Multi-Label Position-Specific Triad Amino Acid Propensity and the physicochemical properties of amino acids to enhance the richness of sequence information. To address the challenge of sample imbalance, the innovative EPDCC Undersampling technique was introduced to adjust the majority class samples. The model's training and testing phase relies on the advanced CSCNN framework. MlyPredCSED, through cross-validation and testing, outperformed existing models, especially in complex categories with multiple modification sites. This research not only provides an efficient method for the identification of lysine modification sites but also demonstrates its value in biological research and drug development. To facilitate efficient use of MlyPredCSED by researchers, we have specifically developed an accessible free web tool: http://www.mlypredcsed.com.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xingze Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Jiankang Chen
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Jiayi Ji
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yuwen Li
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Zeyu Wu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Hongwei Yin
- Department of Oncology, The First Affiliated Hospital of Naval Military Medical University, Shanghai 200000, China
| | - Anjing Zhao
- Department of Oncology, The First Affiliated Hospital of Naval Military Medical University, Shanghai 200000, China
| |
Collapse
|
6
|
Wu CY, Xu ZX, Li N, Qi DY, Wu HY, Ding H, Jin YT. Predicting cyclins based on key features and machine learning methods. Methods 2025; 234:112-119. [PMID: 39694304 DOI: 10.1016/j.ymeth.2024.12.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Revised: 12/12/2024] [Accepted: 12/15/2024] [Indexed: 12/20/2024] Open
Abstract
Cyclins are a group of proteins that regulate the cell cycle process by modulating various stages of cell division to ensure correct cell proliferation, differentiation, and apoptosis. Research on cyclins is crucial for understanding the biological functions and pathological states of cells. However, current research on cyclin identification based on machine learning only focuses on accuracy ignoring the interpretability of features. Therefore, in this study, we pay more attention to the interpretation and analysis of key features associated with cyclins. Firstly, we developed an SVM-based model for identifying cyclins with an accuracy of 92.8% through 5-fold. Then we analyzed the physicochemical properties of the 14 key features used in the model construction and identified the G and charged C1 features that are critical for distinguishing cyclins from non-cyclins. Furthermore, we constructed an SVM-based model using only these two features with an accuracy of 81.3% through the leave-one-out cross-validation. Our study shows that cyclins differ from non-cyclins in their physicochemical properties and that using only two features can achieve good prediction accuracy.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Zhi-Xue Xu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Nan Li
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Dan-Yang Qi
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Hong-Ye Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teachers College, Baotou 014010, China.
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
7
|
Zuo Y, Wan M, Shen Y, Wang X, He W, Bi Y, Liu X, Deng Z. ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique. Comput Biol Chem 2024; 113:108212. [PMID: 39277959 DOI: 10.1016/j.compbiolchem.2024.108212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 09/02/2024] [Accepted: 09/12/2024] [Indexed: 09/17/2024]
Abstract
Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: https://github.com/wmqskr/ILYCROsite.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
| | - Minquan Wan
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yang Shen
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xinheng Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
| |
Collapse
|
8
|
Wu CY, Xu ZX, Li N, Qi DY, Hao ZH, Wu HY, Gao R, Jin YT. Accurately identifying positive and negative regulation of apoptosis using fusion features and machine learning methods. Comput Biol Chem 2024; 113:108207. [PMID: 39265463 DOI: 10.1016/j.compbiolchem.2024.108207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 08/20/2024] [Accepted: 09/06/2024] [Indexed: 09/14/2024]
Abstract
Apoptotic proteins play a crucial role in the apoptosis process, ensuring a balance between cell proliferation and death. Thus, further elucidating the regulatory mechanisms of apoptosis will enhance our understanding of their functions. However, the development of computational methods to accurately identify positive and negative regulation of apoptosis remains a significant challenge. This work proposes a machine learning model based on multi-feature fusion to effectively identify the roles of positive and negative regulation of apoptosis. Initially, we constructed a reliable benchmark dataset containing 200 positive regulation of apoptosis and 241 negative regulation of apoptosis proteins. Subsequently, we developed a classifier that combines the support vector machine (SVM) with pseudo composition of k-spaced amino acid pairs (PseCKSAAP), composition transition distribution (CTD), dipeptide deviation from expected mean (DDE), and PSSM-composition to identify these proteins. Analysis of variance (ANOVA) was employed to select optimized features that could yield the maximum prediction performance. Evaluating the proposed model on independent data revealed and achieved an accuracy of 0.781 with an AUROC of 0.837, demonstrating our model's potent capabilities.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Xue Xu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Nan Li
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Dan-Yang Qi
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Hong Hao
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Hong-Ye Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Ru Gao
- The People's Hospital of Wenjiang, Chengdu, Sichuan 611130, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
9
|
Liu L, Han L, Han K, Zhang Z, Zhang H, Zhang L. Identification of co-localised transcription factors based on paired motifs analysis. IET Syst Biol 2024; 18:238-249. [PMID: 39588827 DOI: 10.1049/syb2.12104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Revised: 10/02/2024] [Accepted: 10/24/2024] [Indexed: 11/27/2024] Open
Abstract
The interaction of transcription factors (TFs) with DNA precisely regulates gene transcription. In mammalian cells, thousands of TFs often interact with DNA cis-regulatory elements in a combinatorial manner rather than act alone. The identification of cooperativity between TFs can help to explore the mechanism of transcriptional regulation. However, little is known about the cooperative patterns of TFs in the genome. To identify which TFs prefer co-localisation, the authors conducted a paired motif analysis in the accessible regions of the human genome based on the Poisson background model. Especially, the authors distinguish the cooperative binding TFs and the competitive binding TFs according to the distance between TF motifs. In the K562 cell line, the authors find that TFs from a same family are always competing the same binding sites, such as FOS_JUN family, whereas KLF family TFs show significant cooperative binding in the adjacency region. Furthermore, the comparative analysis across 16 human cell lines indicates that most TF combination patterns are conserved, but there are still some cell-line-specific patterns. Finally, in human prostate cancer cells (PC-3) and human prostate normal cells (RWPE-2), the authors investigate the specific TF combination patterns in the disease cell and normal cell. The results show that the cooperative binding TF pairs shared by PC-3 and RWPE-2 account for over 90%. Simultaneously, the authors also identify 26 specific TF combination pairs in PC-3 cancer cells.
Collapse
Affiliation(s)
- Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lu Han
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Kaiyuan Han
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng Zhang
- Computer Science and Information Systems, Murray State University, Murray, USA
| | - Haojiang Zhang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| |
Collapse
|
10
|
Mansoor S, Hamid S, Tuan TT, Park JE, Chung YS. Advance computational tools for multiomics data learning. Biotechnol Adv 2024; 77:108447. [PMID: 39251098 DOI: 10.1016/j.biotechadv.2024.108447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 09/01/2024] [Accepted: 09/05/2024] [Indexed: 09/11/2024]
Abstract
The burgeoning field of bioinformatics has seen a surge in computational tools tailored for omics data analysis driven by the heterogeneous and high-dimensional nature of omics data. In biomedical and plant science research multi-omics data has become pivotal for predictive analytics in the era of big data necessitating sophisticated computational methodologies. This review explores a diverse array of computational approaches which play crucial role in processing, normalizing, integrating, and analyzing omics data. Notable methods such similarity-based methods, network-based approaches, correlation-based methods, Bayesian methods, fusion-based methods and multivariate techniques among others are discussed in detail, each offering unique functionalities to address the complexities of multi-omics data. Furthermore, this review underscores the significance of computational tools in advancing our understanding of data and their transformative impact on research.
Collapse
Affiliation(s)
- Sheikh Mansoor
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea
| | - Saira Hamid
- Watson Crick Centre for Molecular Medicine, Islamic University of Science and Technology, Awantipora, Pulwama, J&K, India
| | - Thai Thanh Tuan
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea; Multimedia Communications Laboratory, University of Information Technology, Ho Chi Minh city 70000, Vietnam; Multimedia Communications Laboratory, Vietnam National University, Ho Chi Minh city 70000, Vietnam
| | - Jong-Eun Park
- Department of Animal Biotechnology, College of Applied Life Science, Jeju National University, Jeju, Jeju-do, Republic of Korea.
| | - Yong Suk Chung
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea.
| |
Collapse
|
11
|
Wei Y, Zhou T, Zhai Y, Yu L, Zou Q. FORAlign: accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on Four Russians approach with linear space complexity. Brief Bioinform 2024; 26:bbaf061. [PMID: 39987460 PMCID: PMC11846685 DOI: 10.1093/bib/bbaf061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Revised: 12/30/2024] [Accepted: 02/19/2025] [Indexed: 02/25/2025] Open
Abstract
Pairwise sequence alignment (PSA) serves as the cornerstone in computational bioinformatics, facilitating multiple sequence alignment and phylogenetic analysis. This paper introduces the FORAlign algorithm, leveraging the Four Russians algorithm with identical upper-bound time and space complexity as the Hirschberg divide-and-conquer PSA algorithm, aimed at accelerating Hirschberg PSA algorithm in parallel. Particularly notable is its capability to achieve up to 16.79 times speedup when aligning sequences with low sequence similarity, compared to the conventional Needleman-Wunsch PSA method using non-heuristic methods. Empirical evaluations underscore FORAlign's superiority over existing wavefront alignment (WFA) series software, especially in scenarios characterized by low sequence similarity during PSA tasks. Our method is capable of directly aligning monkeypox sequences with other sequences using non-heuristic methods. The algorithm was implemented within the FORAlign library, providing functionality for PSA and foundational support for multiple sequence alignment and phylogenetic trees. The FORAlign library is freely available at https://github.com/malabz/FORAlign.
Collapse
Affiliation(s)
- Yanming Wei
- School of Computer Science and Technology, No. 266, Xinglong Section of Xifeng Road, Chang'an Zone, Xidian University, Xi’an 710126, China
- Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No. 1, Chengdian Road, Kecheng Zone, Quzhou 324003, China
| | - Tong Zhou
- Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No. 1, Chengdian Road, Kecheng Zone, Quzhou 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China
| | - Yixiao Zhai
- Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No. 1, Chengdian Road, Kecheng Zone, Quzhou 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China
| | - Liang Yu
- School of Computer Science and Technology, No. 266, Xinglong Section of Xifeng Road, Chang'an Zone, Xidian University, Xi’an 710126, China
| | - Quan Zou
- Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No. 1, Chengdian Road, Kecheng Zone, Quzhou 324003, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China
| |
Collapse
|
12
|
Yu S, Liu L, Wang H, Yan S, Zheng S, Ning J, Luo R, Fu X, Deng X. AtML: An Arabidopsis thaliana root cell identity recognition tool for medicinal ingredient accumulation. Methods 2024; 231:61-69. [PMID: 39293728 DOI: 10.1016/j.ymeth.2024.09.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Revised: 08/05/2024] [Accepted: 09/12/2024] [Indexed: 09/20/2024] Open
Abstract
Arabidopsis thaliana synthesizes various medicinal compounds, and serves as a model plant for medicinal plant research. Single-cell transcriptomics technologies are essential for understanding the developmental trajectory of plant roots, facilitating the analysis of synthesis and accumulation patterns of medicinal compounds in different cell subpopulations. Although methods for interpreting single-cell transcriptomics data are rapidly advancing in Arabidopsis, challenges remain in precisely annotating cell identity due to the lack of marker genes for certain cell types. In this work, we trained a machine learning system, AtML, using sequencing datasets from six cell subpopulations, comprising a total of 6000 cells, to predict Arabidopsis root cell stages and identify biomarkers through complete model interpretability. Performance testing using an external dataset revealed that AtML achieved 96.50% accuracy and 96.51% recall. Through the interpretability provided by AtML, our model identified 160 important marker genes, contributing to the understanding of cell type annotations. In conclusion, we trained AtML to efficiently identify Arabidopsis root cell stages, providing a new tool for elucidating the mechanisms of medicinal compound accumulation in Arabidopsis roots.
Collapse
Affiliation(s)
- Shicong Yu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Lijia Liu
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Hao Wang
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Shen Yan
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Shuqin Zheng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Jing Ning
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Ruxian Luo
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China
| | - Xiangzheng Fu
- Research Institute of Hunan University in Chongqing, Chongqing 401120, China.
| | - Xiaoshu Deng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest China, Rice Research Institute, Sichuan Agricultural University, Chengdu 611130, China; Chongqing Academy of Chinese Materia Medica, Chongqing 400065, China.
| |
Collapse
|
13
|
Liu Y, Xia X, Gong Y, Song B, Zeng X. SSR-DTA: Substructure-aware multi-layer graph neural networks for drug-target binding affinity prediction. Artif Intell Med 2024; 157:102983. [PMID: 39321746 DOI: 10.1016/j.artmed.2024.102983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 09/10/2024] [Accepted: 09/13/2024] [Indexed: 09/27/2024]
Abstract
Accurate prediction of drug-target binding affinity (DTA) is essential in the field of drug discovery. Recently, scientists have been attempting to utilize artificial intelligence prediction to screen out a significant number of ineffective compounds, thereby mitigating labor and financial losses. While graph neural networks (GNNs) have been applied to DTA, existing GNNs have limitations in effectively extracting substructural features across various sizes. Functional groups play a crucial role in modulating molecular properties, but existing GNNs struggle with feature extraction from certain motifs due to scale mismatches. Additionally, sequence-based models for target proteins lack the integration of structural information. To address these limitations, we present SSR-DTA, a multi-layer graph network capable of adapting to diverse structural sizes, which can extract richer biological features, thereby improving the robustness and accuracy of predictions. Multi-layer GNNs enable the capture of molecular motifs across different scales, ranging from atomic to macrocyclic motifs. Furthermore, we introduce BiGNN to simultaneously learn sequence and structural information. Sequence information corresponds to the primary structure of proteins, while graph information represents the tertiary structure. BiGNN assimilates richer information compared to sequence-based methods while mitigating the impact of errors from predicted structures, resulting in more accurate predictions. Through rigorous experimental evaluations conducted on four benchmark datasets, we demonstrate the superiority of SSR-DTA over state-of-the-art models. Particularly, in comparison to state-of-the-art models, SSR-DTA demonstrates an impressive 20% reduction in mean squared error on the Davis dataset and a 5% reduction on the KIBA dataset, underscoring its potential as a valuable tool for advancing DTA prediction.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China; Key Laboratory of Intelligent Computing & Signal Processing of Ministry of Education, Anhui University, Hefei, 230601, Anhui, China
| | - Xinyan Xia
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China
| | - Yongshun Gong
- School of Software, Shandong University, Jinan, 250100, Shandong, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China.
| |
Collapse
|
14
|
Zuo Y, Fang X, Wan J, He W, Liu X, Zeng X, Deng Z. PreMLS: The undersampling technique based on ClusterCentroids to predict multiple lysine sites. PLoS Comput Biol 2024; 20:e1012544. [PMID: 39436947 PMCID: PMC11530015 DOI: 10.1371/journal.pcbi.1012544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/01/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein's fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins' 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Xingze Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Jiayong Wan
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| |
Collapse
|
15
|
Ahmed Z, Shahzadi K, Temesgen SA, Ahmad B, Chen X, Ning L, Zulfiqar H, Lin H, Jin YT. A protein pre-trained model-based approach for the identification of the liquid-liquid phase separation (LLPS) proteins. Int J Biol Macromol 2024; 277:134146. [PMID: 39067723 DOI: 10.1016/j.ijbiomac.2024.134146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/06/2024] [Accepted: 07/23/2024] [Indexed: 07/30/2024]
Abstract
Liquid-liquid phase separation (LLPS) regulates many biological processes including RNA metabolism, chromatin rearrangement, and signal transduction. Aberrant LLPS potentially leads to serious diseases. Therefore, the identification of the LLPS proteins is crucial. Traditionally, biochemistry-based methods for identifying LLPS proteins are costly, time-consuming, and laborious. In contrast, artificial intelligence-based approaches are fast and cost-effective and can be a better alternative to biochemistry-based methods. Previous research methods employed word2vec in conjunction with machine learning or deep learning algorithms. Although word2vec captures word semantics and relationships, it might not be effective in capturing features relevant to protein classification, like physicochemical properties, evolutionary relationships, or structural features. Additionally, other studies often focused on a limited set of features for model training, including planar π contact frequency, pi-pi, and β-pairing propensities. To overcome such shortcomings, this study first constructed a reliable dataset containing 1206 protein sequences, including 603 LLPS and 603 non-LLPS protein sequences. Then a computational model was proposed to efficiently identify the LLPS proteins by perceiving semantic information of protein sequences directly; using an ESM2-36 pre-trained model based on transformer architecture in conjunction with a convolutional neural network. The model could achieve an accuracy of 85.68% and 89.67%, respectively on training data and test data, surpassing the accuracy of previous studies. The performance demonstrates the potential of our computational methods as efficient alternatives for identifying LLPS proteins.
Collapse
Affiliation(s)
- Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Kiran Shahzadi
- Department of Biotechnology, Women University of Azad Jammu and Kashmir, Bagh, Azad Kashmir, Pakistan.
| | - Sebu Aboma Temesgen
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Basharat Ahmad
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Xiang Chen
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Lin Ning
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China.
| | - Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| |
Collapse
|
16
|
Zuo Y, Zhang B, He W, Bi Y, Liu X, Zeng X, Deng Z. MSlocPRED: deep transfer learning-based identification of multi-label mRNA subcellular localization. Brief Bioinform 2024; 25:bbae504. [PMID: 39401145 PMCID: PMC11472759 DOI: 10.1093/bib/bbae504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/19/2024] [Accepted: 09/30/2024] [Indexed: 10/17/2024] Open
Abstract
Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, 5340 Xiping Road, Beichen District, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Wellington Rd, Clayton VIC 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, 422 Siming South Road, Siming District, Xiamen City, Fujian 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Yuelu District, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| |
Collapse
|
17
|
Zhao Y, Jin J, Gao W, Qiao J, Wei L. Moss-m7G: A Motif-Based Interpretable Deep Learning Method for RNA N7-Methlguanosine Site Prediction. J Chem Inf Model 2024; 64:6230-6240. [PMID: 39011571 DOI: 10.1021/acs.jcim.4c00802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
N-7methylguanosine (m7G) modification plays a crucial role in various biological processes and is closely associated with the development and progression of many cancers. Accurate identification of m7G modification sites is essential for understanding their regulatory mechanisms and advancing cancer therapy. Previous studies often suffered from insufficient research data, underutilization of motif information, and lack of interpretability. In this work, we designed a novel motif-based interpretable method for m7G modification site prediction, called Moss-m7G. This approach enables the analysis of RNA sequences from a motif-centric perspective. Our proposed word-detection module and motif-embedding module within Moss-m7G extract motif information from sequences, transforming the raw sequences from base-level into motif-level and generating embeddings for these motif sequences. Compared with base sequences, motif sequences contain richer contextual information, which is further analyzed and integrated through the Transformer model. We constructed a comprehensive m7G data set to implement the training and testing process to address the data insufficiency noted in prior research. Our experimental results affirm the effectiveness and superiority of Moss-m7G in predicting m7G modification sites. Moreover, the introduction of the word-detection module enhances the interpretability of the model, providing insights into the predictive mechanisms.
Collapse
Affiliation(s)
- Yanxi Zhao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wenjia Gao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
- School of Informatics, Xiamen University, Xiamen 361104, China
| |
Collapse
|
18
|
Jiao S, Ye X, Sakurai T, Zou Q, Liu R. Integrated convolution and self-attention for improving peptide toxicity prediction. Bioinformatics 2024; 40:btae297. [PMID: 38696758 PMCID: PMC11654579 DOI: 10.1093/bioinformatics/btae297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/02/2024] [Accepted: 04/30/2024] [Indexed: 05/04/2024] Open
Abstract
MOTIVATION Peptides are promising agents for the treatment of a variety of diseases due to their specificity and efficacy. However, the development of peptide-based drugs is often hindered by the potential toxicity of peptides, which poses a significant barrier to their clinical application. Traditional experimental methods for evaluating peptide toxicity are time-consuming and costly, making the development process inefficient. Therefore, there is an urgent need for computational tools specifically designed to predict peptide toxicity accurately and rapidly, facilitating the identification of safe peptide candidates for drug development. RESULTS We provide here a novel computational approach, CAPTP, which leverages the power of convolutional and self-attention to enhance the prediction of peptide toxicity from amino acid sequences. CAPTP demonstrates outstanding performance, achieving a Matthews correlation coefficient of approximately 0.82 in both cross-validation settings and on independent test datasets. This performance surpasses that of existing state-of-the-art peptide toxicity predictors. Importantly, CAPTP maintains its robustness and generalizability even when dealing with data imbalances. Further analysis by CAPTP reveals that certain sequential patterns, particularly in the head and central regions of peptides, are crucial in determining their toxicity. This insight can significantly inform and guide the design of safer peptide drugs. AVAILABILITY AND IMPLEMENTATION The source code for CAPTP is freely available at https://github.com/jiaoshihu/CAPTP.
Collapse
Affiliation(s)
- Shihu Jiao
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba,
Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic
Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science
and Technology of China, Quzhou 324000, China
| | - Ruijun Liu
- School of Software, Beihang University, Beijing 100191,
China
| |
Collapse
|
19
|
Dong S, Liu Y, Gong Y, Dong X, Zeng X. scCAN: Clustering With Adaptive Neighbor-Based Imputation Method for Single-Cell RNA-Seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:95-105. [PMID: 38285569 DOI: 10.1109/tcbb.2023.3337231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to study cellular heterogeneity in different samples. However, due to technical deficiencies, dropout events often result in zero gene expression values in the gene expression matrix. In this paper, we propose a new imputation method called scCAN, based on adaptive neighborhood clustering, to estimate the zero value of dropouts. Our method continuously updates cell-cell similarity information by simultaneously learning similarity relationships, clustering structures, and imposing new rank constraints on the Laplacian matrix of the similarity matrix, improving the imputation of dropout zero values. To evaluate the performance of this method, we used four simulated and eight real scRNA-seq data for downstream analyses, including cell clustering, recovered gene expression, and reconstructed cell trajectories. Our method improves the performance of the downstream analysis and is better than other imputation methods.
Collapse
|
20
|
Tao W, Liu Y, Lin X, Song B, Zeng X. Prediction of multi-relational drug-gene interaction via Dynamic hyperGraph Contrastive Learning. Brief Bioinform 2023; 24:bbad371. [PMID: 37864294 DOI: 10.1093/bib/bbad371] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/11/2023] [Accepted: 09/29/2023] [Indexed: 10/22/2023] Open
Abstract
Drug-gene interaction prediction occupies a crucial position in various areas of drug discovery, such as drug repurposing, lead discovery and off-target detection. Previous studies show good performance, but they are limited to exploring the binding interactions and ignoring the other interaction relationships. Graph neural networks have emerged as promising approaches owing to their powerful capability of modeling correlations under drug-gene bipartite graphs. Despite the widespread adoption of graph neural network-based methods, many of them experience performance degradation in situations where high-quality and sufficient training data are unavailable. Unfortunately, in practical drug discovery scenarios, interaction data are often sparse and noisy, which may lead to unsatisfactory results. To undertake the above challenges, we propose a novel Dynamic hyperGraph Contrastive Learning (DGCL) framework that exploits local and global relationships between drugs and genes. Specifically, graph convolutions are adopted to extract explicit local relations among drugs and genes. Meanwhile, the cooperation of dynamic hypergraph structure learning and hypergraph message passing enables the model to aggregate information in a global region. With flexible global-level messages, a self-augmented contrastive learning component is designed to constrain hypergraph structure learning and enhance the discrimination of drug/gene representations. Experiments conducted on three datasets show that DGCL is superior to eight state-of-the-art methods and notably gains a 7.6% performance improvement on the DGIdb dataset. Further analyses verify the robustness of DGCL for alleviating data sparsity and over-smoothing issues.
Collapse
Affiliation(s)
- Wen Tao
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082 Hunan, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082 Hunan, China
| | - Xuan Lin
- School of Computer Science, Xiangtan University, Xiangtan, 411105 Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing, Ministry of Education (Xiangtan University), Xiangtan, 411105 Hunan, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082 Hunan, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082 Hunan, China
| |
Collapse
|