1
|
Bai H, Wang J, Jiang X, Guo Z, Yang W, Yang Z, Li J, Liu C. TetraRNA, a tetra-class machine learning model for deciphering the coding potential derivation of RNA world. Comput Struct Biotechnol J 2025; 27:1305-1317. [PMID: 40230410 PMCID: PMC11994946 DOI: 10.1016/j.csbj.2025.03.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 03/20/2025] [Accepted: 03/24/2025] [Indexed: 04/16/2025] Open
Abstract
CncRNAs (coding and noncoding RNAs) are a class of bifunctional RNAs that that has both coding and noncoding biological activity. An increasing number of cncRNAs are being identified, prompting reassessment of our knowledge of RNA. However, most existing RNA classification tools are based on binary classification models which are not effective in distinguishing cncRNAs from mRNAs or long noncoding RNAs (lncRNAs). Our statistical analysis demonstrated that mRNA-derived cncRNAs (untranslated mRNAs, untr-mRNAs) and lncRNA-derived cncRNAs (translated ncRNAs, tr-ncRNAs) do not fall in the same cluster. Therefore, in this study, we devised a novel tetra-class RNA classification model that is systematically optimized for RNA feature extraction. According to our model, all human RNAs can be reclassified into one of four categories - mRNA, untr-mRNA, lncRNA, and tr-ncRNA - representing a novel RNA classification system and allowing the discovery of more potential cncRNAs. Further analysis revealed significant differences among the four types of RNAs in tissue-specific expression, functional annotation, sequence composition, and other factors, providing insights into their divergent evolution trajectories. Moreover, investigation of the small tr-ncRNA peptides demonstrated that their evolution is coordinated with that of the the conserved functional small RNAs associated with them. All analysis results have been integrated into a database - TetraRNADB accessible online (http://tetrarnadb.liu-lab.com/).
Collapse
Affiliation(s)
- Hanrui Bai
- College of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| | - Jie Wang
- Department of Chromosome Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linne-Weg 10, Cologne 50829, Germany
| | - Xiaoke Jiang
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| | - Zhen Guo
- College of Science and Engineering, Saint Louis University, St. Louis, MO 63103, USA
| | - Wenjing Yang
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| | - Zitian Yang
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| | - Jing Li
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| | - Changning Liu
- College of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230026, China
- CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Yunnan Key Laboratory of Crop Wild Relatives Omics, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
2
|
Lee KW, Pham NT, Min HJ, Park HW, Lee JW, Lo HE, Kwon NY, Seo J, Shaginyan I, Cho H, Wei L, Manavalan B, Jeon YJ. DOGpred: A Novel Deep Learning Framework for Accurate Identification of Human O-linked Threonine Glycosylation Sites. J Mol Biol 2025; 437:168977. [PMID: 39900285 DOI: 10.1016/j.jmb.2025.168977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 01/06/2025] [Accepted: 01/28/2025] [Indexed: 02/05/2025]
Abstract
O-linked glycosylation is a crucial post-translational modification that regulates protein function and biological processes. Dysregulation of this process is associated with various diseases, underscoring the need to accurately identify O-linked glycosylation sites on proteins. Current experimental methods for identifying O-linked threonine glycosylation (OTG) sites are often complex and costly. Consequently, developing computational tools that predict these sites based on protein features is crucial. Such tools can complement experimental approaches, enhancing our understanding of the role of OTG dysregulation in diseases and uncovering potential therapeutic targets. In this study, we developed DOGpred, a deep learning-based predictor for precisely identifying human OTGs using high-latent feature representations. Initially, we extracted nine different conventional feature descriptors (CFDs) and nine pre-trained protein language model (PLM)-based embeddings. Notably, each feature was encoded as a 2D tensor, capturing both the sequential and inherent feature characteristics. Subsequently, we designed a stacked convolutional neural network (CNN) module to learn spatial feature representations from CFDs and a stacked recurrent neural network (RNN) module to learn temporal feature representations from PLM-based embeddings. These features were integrated using attention-based fusion mechanisms to generate high-level feature representations for final classification. Ablation analysis and independent tests demonstrated that the optimal model (DOGpred), employing a stacked 1D CNN and a stacked attention-based RNN modules with cross-attention feature fusion, achieved the best performance on the training dataset and significantly outperformed machine learning-based single-feature models and state-of-the-art methods on independent datasets. Furthermore, DOGpred is publicly available at https://github.com/JeonRPM/DOGpred/ for free access and usage.
Collapse
Affiliation(s)
- Ki Wook Lee
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hye Jung Min
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hyun Woo Park
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Ji Won Lee
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Han-En Lo
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Na Young Kwon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Jimin Seo
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Illia Shaginyan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Heeje Cho
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Leyi Wei
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macau
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea.
| |
Collapse
|
3
|
Su Q, Phan LT, Pham NT, Wei L, Manavalan B. MST-m6A: A Novel Multi-Scale Transformer-based Framework for Accurate Prediction of m6A Modification Sites Across Diverse Cellular Contexts. J Mol Biol 2025; 437:168856. [PMID: 39510345 DOI: 10.1016/j.jmb.2024.168856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 10/23/2024] [Accepted: 11/02/2024] [Indexed: 11/15/2024]
Abstract
N6-methyladenosine (m6A) modification, a prevalent epigenetic mark in eukaryotic cells, is crucial in regulating gene expression and RNA metabolism. Accurately identifying m6A modification sites is essential for understanding their functions within biological processes and the intricate mechanisms that regulate them. Recent advances in high-throughput sequencing technologies have enabled the generation of extensive datasets characterizing m6A modification sites at single-nucleotide resolution, leading to the development of computational methods for identifying m6A RNA modification sites. However, most current methods focus on specific cell lines, limiting their generalizability and practical application across diverse biological contexts. To address the limitation, we propose MST-m6A, a novel approach for identifying m6A modification sites with higher accuracy across various cell lines and tissues. MST-m6A utilizes a multi-scale transformer-based architecture, employing dual k-mer tokenization to capture rich feature representations and global contextual information from RNA sequences at multiple levels of granularity. These representations are then effectively combined using a channel fusion mechanism and further processed by a convolutional neural network to enhance prediction accuracy. Rigorous validation demonstrates that MST-m6A significantly outperforms conventional machine learning models, deep learning models, and state-of-the-art predictors. We anticipate that the high precision and cross-cell-type adaptability of MST-m6A will provide valuable insights into m6A biology and facilitate advancements in related fields. The proposed approach is available at https://github.com/cbbl-skku-org/MST-m6A/ for prediction and reproducibility purposes.
Collapse
Affiliation(s)
- Qiaosen Su
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, Macau
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
4
|
Ho CH, Chu YW, Huang LY, Chen CW. SUMO-LMNet: Lossless mapping network for predicting SUMOylation sites in SUMO1 and SUMO2 using high-dimensional features. Comput Struct Biotechnol J 2025; 27:1048-1059. [PMID: 40143924 PMCID: PMC11937687 DOI: 10.1016/j.csbj.2025.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/02/2025] [Accepted: 03/04/2025] [Indexed: 03/28/2025] Open
Abstract
Accurate SUMOylation site prediction is crucial for deciphering gene regulation and disease mechanisms. However, distinguishing SUMO1 and SUMO2 modifications remains a major challenge due to their structural similarities. Conventional prediction models often struggle to differentiate between these paralogues, limiting their applicability in biological research. To address this, we introduce SUMO-LMNet, a deep learning-based framework for the precise prediction of SUMO1 and SUMO2 sites. Unlike previous models, SUMO-LMNet integrates a lossless mapping strategy and deep learning architectures to enhance both prediction accuracy and interpretability. Our model extracts high-dimensional features from sequences and transforms them into two-dimensional feature maps, enabling convolutional neural networks (CNNs) to effectively capture both local and global dependencies within the data. By leveraging a Lossless Mapping Network (LM-Net), this approach preserves the original feature space, ensuring that feature integrity is retained without loss of spatial information. While Grad-CAM highlights key features in individual predictions, it lacks consistency across samples and does not provide a dataset-wide evaluation of feature importance. To address this, we introduce Combined Heatmap Feature Analysis (CHFA), which systematically aggregates feature importance across multiple samples, providing a more reliable and interpretable dataset-wide assessment. Experimental results reveal distinct feature dependencies between SUMO1 and SUMO2, underscoring the necessity of paralogue-specific predictive models. Through a systematic comparison of multiple neural network architectures, we demonstrate that our model achieves over 80 % accuracy in distinguishing SUMO1 and SUMO2 modification sites. By prioritizing candidate sites for further study, our model aids experimental design and accelerates the discovery of biologically relevant SUMOylation targets. SUMO-LMNet is publicly available at https://predictor.isu.edu.tw/sumo-lmnet.
Collapse
Affiliation(s)
- Cheng-Hsun Ho
- Department of Medical Laboratory Science, College of Medical Science and Technology, I-Shou University, Kaohsiung City, Taiwan
| | - Yen-Wei Chu
- Graduate Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung City, Taiwan
- Doctoral Program in Medical Biotechnology, National Chung Hsing University, Taichung City, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, Taichung City, Taiwan
- Smart Sustainable New Agriculture Research Center (SMARTer), Taichung City, Taiwan
| | - Lan-Ying Huang
- Doctoral Program in Medical Biotechnology, National Chung Hsing University, Taichung City, Taiwan
| | - Chi-Wei Chen
- Graduate Degree Program of Smart Healthcare & Bioinformatics, I-Shou University, Kaohsiung City, Taiwan
- Department of Biomedical Engineering, I-Shou University, Kaohsiung City, Taiwan
| |
Collapse
|
5
|
Basith S, Manavalan B, Lee G. AntiT2DMP-Pred: Leveraging feature fusion and optimization for superior machine learning prediction of type 2 diabetes mellitus. Methods 2025; 234:264-274. [PMID: 39798942 DOI: 10.1016/j.ymeth.2025.01.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 12/26/2024] [Accepted: 01/04/2025] [Indexed: 01/15/2025] Open
Abstract
Pancreatic α-amylase breaks down starch into isomaltose and maltose, which are further hydrolyzed by α-glucosidase in the intestine into monosaccharides, rapidly raising blood sugar levels and contributing to type 2 diabetes mellitus (T2DM). Synthetic inhibitors of carbohydrate-digesting enzymes are used to manage T2DM but may harm organ function over time. Bioactive peptides offer a safer alternative, avoiding such adverse effects. Computational methods for predicting antidiabetic peptides (ADPs) can significantly reduce the time and cost of experimental testing. While machine learning (ML) has been applied to identify ADPs, advancements in data analysis and algorithms continue to drive progress in the field. To address this, we developed AntiT2DMP-Pred, the first ML-based tool specifically designed for predicting type 2 antidiabetic peptides (T2ADPs). This tool employs a feature fusion strategy, combining ten highly discriminative feature descriptors chosen from a pool of 32 descriptors and eight ML algorithms, tested across a range of baseline models. AntiT2DMP-Pred demonstrated excellent performance, surpassing both baseline and feature-optimized models, with an accuracy (ACC) and Matthews' correlation coefficient (MCC) of 0.976 and 0.953 on the training dataset, and an ACC and MCC of 0.957 and 0.851 on the independent dataset. The web server (https://balalab-skku.org/AntiT2DMP-Pred) is freely accessible, enabling researchers worldwide to utilize it in their experimental workflows and contribute to the discovery and understanding of T2ADPs, ultimately supporting peptide-based therapeutic development for diabetes management.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499 Republic of Korea.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419 Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499 Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon 16499 Republic of Korea.
| |
Collapse
|
6
|
Guan J, Dong D, Xie P, Zhao Z, Guo Y, Lee TY, Yao L, Chiang YC. StackDILI: Enhancing Drug-Induced Liver Injury Prediction through Stacking Strategy with Effective Molecular Representations. J Chem Inf Model 2025; 65:1027-1039. [PMID: 39786982 DOI: 10.1021/acs.jcim.4c02079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
Drug-induced liver injury (DILI) is a major challenge in drug development, often leading to clinical trial failures and market withdrawals due to liver toxicity. This study presents StackDILI, a computational framework designed to accelerate toxicity assessment by predicting DILI risk. StackDILI integrates multiple molecular descriptors to extract structural and physicochemical features, including the constitution, pharmacophore, MACCS, and E-state descriptors. Additionally, a genetic algorithm is employed for feature selection and optimization, ensuring that the most relevant features are used. These optimized features are processed through a stacking ensemble model comprising multiple tree-based machine learning models, improving prediction accuracy and interpretability. Notably, StackDILI demonstrates a strong performance on the DILIrank test set and maintains robustness across cross-validation. Moreover, interpretability analysis reveals key molecular features associated with DILI risks, providing valuable insights into toxicity prediction. To further improve accessibility, a user-friendly web interface is developed, allowing users to input SMILES strings and receive rapid predictions with ease. The StackDILI model provides a powerful tool for efficient DILI assessment, supporting safer drug development. The web interface is accessible at https://awi.cuhk.edu.cn/biosequence/StackDILI/.
Collapse
Affiliation(s)
- Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Danhong Dong
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilin Guo
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
- Center for Intelligent Drug Systems and Smart Bio-Devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
| | - Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| |
Collapse
|
7
|
Luo J, Zhao K, Chen J, Yang C, Qu F, Liu Y, Jin X, Yan K, Zhang Y, Liu B. iMFP-LG: Identify Novel Multi-functional Peptides Using Protein Language Models and Graph-based Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae084. [PMID: 39585308 PMCID: PMC12011362 DOI: 10.1093/gpbjnl/qzae084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/25/2024] [Accepted: 11/21/2024] [Indexed: 11/26/2024]
Abstract
Functional peptides are short amino acid fragments that have a wide range of beneficial functions for living organisms. The majority of previous studies have focused on mono-functional peptides, but an increasing number of multi-functional peptides have been discovered. Although there have been enormous experimental efforts to assay multi-functional peptides, only a small portion of millions of known peptides has been explored. The development of effective and accurate techniques for identifying multi-functional peptides can facilitate their discovery and mechanistic understanding. In this study, we presented iMFP-LG, a method for multi-functional peptide identification based on protein language models (pLMs) and graph attention networks (GATs). Our comparative analyses demonstrated that iMFP-LG outperformed the state-of-the-art methods in identifying both multi-functional bioactive peptides and multi-functional therapeutic peptides. The interpretability of iMFP-LG was also illustrated by visualizing attention patterns in pLMs and GATs. Regarding the outstanding performance of iMFP-LG on the identification of multi-functional peptides, we employed iMFP-LG to screen novel peptides with both anti-microbial and anti-cancer functions from millions of known peptides in the UniRef90 database. As a result, eight candidate peptides were identified, among which one candidate was validated to process both anti-bacterial and anti-cancer properties through molecular structure alignment and biological experiments. We anticipate that iMFP-LG can assist in the discovery of multi-functional peptides and contribute to the advancement of peptide drug design.
Collapse
Affiliation(s)
- Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Kejuan Zhao
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Caihua Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Fuchuan Qu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 10081, China
| |
Collapse
|
8
|
Ding S, Zheng J, Jia C. DeepMEns: an ensemble model for predicting sgRNA on-target activity based on multiple features. Brief Funct Genomics 2025; 24:elae043. [PMID: 39528429 PMCID: PMC11735754 DOI: 10.1093/bfgp/elae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 10/12/2024] [Accepted: 10/21/2024] [Indexed: 11/16/2024] Open
Abstract
The CRISPR/Cas9 system developed from Streptococcus pyogenes (SpCas9) has high potential in gene editing. However, its successful application is hindered by the considerable variability in target efficiencies across different single guide RNAs (sgRNAs). Although several deep learning models have been created to predict sgRNA on-target activity, the intrinsic mechanisms of these models are difficult to explain, and there is still scope for improvement in prediction performance. To overcome these issues, we propose an ensemble interpretable model termed DeepMEns based on deep learning to predict sgRNA on-target activity. By using five different training and validation datasets, we constructed five sub-regressors, each comprising three parts. The first part uses one-hot encoding, wherein 0-1 representation of the secondary structure is used as the input to the convolutional neural network (CNN) with Transformer encoder. The second part uses the DNA shape feature matrix as the input to the CNN with Transformer encoder. The third part uses positional encoding feature matrices as the proposed input into a long short-term memory network with an attention mechanism. These three parts are concatenated through the flattened layer, and the final prediction result is the average of the five sub-regressors. Extensive benchmarking experiments indicated that DeepMEns achieved the highest Spearman correlation coefficient for 6 of 10 independent test datasets as compared to previous predictors, this finding confirmed that DeepMEns can accomplish state-of-the-art performance. Moreover, the ablation analysis also indicated that the ensemble strategy may improve the performance of the prediction model.
Collapse
Affiliation(s)
- Shumei Ding
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
9
|
Pimtawong T, Ren J, Lee J, Lee HM, Na D. A review on computational models for predicting protein solubility. J Microbiol 2025; 63:e.2408001. [PMID: 39895070 DOI: 10.71150/jm.2408001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/29/2024] [Indexed: 02/04/2025]
Abstract
Protein solubility is a critical factor in the production of recombinant proteins, which are widely used in various industries, including pharmaceuticals, diagnostics, and biotechnology. Predicting protein solubility remains a challenging task due to the complexity of protein structures and the multitude of factors influencing solubility. Recent advances in computational methods, particularly those based on machine learning, have provided powerful tools for predicting protein solubility, thereby reducing the need for extensive experimental trials. This review provides an overview of current computational approaches to predict protein solubility. We discuss the datasets, features, and algorithms employed in these models. The review aims to bridge the gap between computational predictions and experimental validations, fostering the development of more accurate and reliable solubility prediction models that can significantly enhance recombinant protein production.
Collapse
Affiliation(s)
- Teerapat Pimtawong
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Jun Ren
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Jingyu Lee
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Hyang-Mi Lee
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| | - Dokyun Na
- Department of Biomedical Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
10
|
Lai H, Zhu T, Xie S, Luo X, Hong F, Luo D, Dao F, Lin H, Shu K, Lv H. Empirical Comparison and Analysis of Artificial Intelligence-Based Methods for Identifying Phosphorylation Sites of SARS-CoV-2 Infection. Int J Mol Sci 2024; 25:13674. [PMID: 39769436 PMCID: PMC11678915 DOI: 10.3390/ijms252413674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 12/18/2024] [Accepted: 12/19/2024] [Indexed: 01/11/2025] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a member of the large coronavirus family with high infectivity and pathogenicity and is the primary pathogen causing the global pandemic of coronavirus disease 2019 (COVID-19). Phosphorylation is a major type of protein post-translational modification that plays an essential role in the process of SARS-CoV-2-host interactions. The precise identification of phosphorylation sites in host cells infected with SARS-CoV-2 will be of great importance to investigate potential antiviral responses and mechanisms and exploit novel targets for therapeutic development. Numerous computational tools have been developed on the basis of phosphoproteomic data generated by mass spectrometry-based experimental techniques, with which phosphorylation sites can be accurately ascertained across the whole SARS-CoV-2-infected proteomes. In this work, we have comprehensively reviewed several major aspects of the construction strategies and availability of these predictors, including benchmark dataset preparation, feature extraction and refinement methods, machine learning algorithms and deep learning architectures, model evaluation approaches and metrics, and publicly available web servers and packages. We have highlighted and compared the prediction performance of each tool on the independent serine/threonine (S/T) and tyrosine (Y) phosphorylation datasets and discussed the overall limitations of current existing predictors. In summary, this review would provide pertinent insights into the exploitation of new powerful phosphorylation site identification tools, facilitate the localization of more suitable target molecules for experimental verification, and contribute to the development of antiviral therapies.
Collapse
Affiliation(s)
- Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Tao Zhu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Sijia Xie
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Xinwei Luo
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Feitong Hong
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Diyu Luo
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Fuying Dao
- School of Biological Sciences, Nanyang Technological University, Singapore 639798, Singapore;
| | - Hao Lin
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| | - Kunxian Shu
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (H.L.); (T.Z.); (D.L.)
| | - Hao Lv
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; (S.X.); (X.L.); (F.H.); (H.L.)
| |
Collapse
|
11
|
Miao R, Xu G, Ding Y, Ding Z, Woodard J, Tu T, Luo H, Wu N, Yao B, Guan F, Tian J. Engineering dual-functional and thermophilic BMHETase for efficient degradation of polyethylene terephthalate. BIORESOURCE TECHNOLOGY 2024; 414:131556. [PMID: 39357610 DOI: 10.1016/j.biortech.2024.131556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 09/15/2024] [Accepted: 09/29/2024] [Indexed: 10/04/2024]
Abstract
Polyethylene terephthalate (PET) biodegradation is hindered by the intermediates bis (2-hydroxyethyl) terephthalate (BHET) and mono (2-hydroxyethyl) terephthalate (MHET). BMHETase, a thermophilic hydrolase identified from the UniParc database, exhibits degradation activity towards both BHET and MHET. BMHETase showed higher activity on BHET than LCCICCG and FASTPETase at temperatures ranging from 50 to 70℃. To enhance its activity in degrading MHET, BMHETase was engineered to mimic Ideonella sakaiensis MHETase. The resulting 6-point mutant's activities on MHET and BHET were 8 and 2 times those of the WT, with both optimal temperatures increased by 5℃. This enhancement may be attributed to the BMHETase6M's intensified binding ability with MHET and enlarged binding pocket. When combined with LCCICCG, BMHETase6M achieved complete degradation of MHET in PET films to terephthalic acid, indicating broad application potential. These findings suggest that BMHETase6M holds promise as a candidate for enhancing PET biodegradation efficiency and plastic waste management.
Collapse
Affiliation(s)
- Ruiju Miao
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Guoshun Xu
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Yekun Ding
- National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Zundan Ding
- National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Jaie Woodard
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA.
| | - Tao Tu
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China.
| | - Huiying Luo
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China.
| | - Ningfeng Wu
- National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Bin Yao
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China.
| | - Feifei Guan
- National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Jian Tian
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences, Beijing 100193, China; National Key Laboratory of Agricultural Microbiology, Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| |
Collapse
|
12
|
Guan J, Xie P, Dong D, Liu Q, Zhao Z, Guo Y, Zhang Y, Lee TY, Yao L, Chiang YC. DeepKlapred: A deep learning framework for identifying protein lysine lactylation sites via multi-view feature fusion. Int J Biol Macromol 2024; 283:137668. [PMID: 39566793 DOI: 10.1016/j.ijbiomac.2024.137668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 11/10/2024] [Accepted: 11/13/2024] [Indexed: 11/22/2024]
Abstract
Lysine lactylation (Kla) is a post-translational modification (PTM) that holds significant importance in the regulation of various biological processes. While traditional experimental methods are highly accurate for identifying Kla sites, they are both time-consuming and labor-intensive. Recent machine learning advances have enabled computational models for Kla site prediction. In this study, we propose a novel framework that integrates sequence embedding with sequence descriptors to enhance the representation of protein sequence features. Our framework employs a BiGRU-Transformer architecture to capture both local and global dependencies within the sequence, while incorporating six sequence descriptors to extract biochemical properties and evolutionary patterns. Additionally, we apply a cross-attention fusion mechanism to combine sequence embeddings with descriptor-based features, enabling the model to capture complex interactions between different feature representations. Our model demonstrated excellent performance in predicting Kla sites, achieving an accuracy of 0.998 on the training set and 0.969 on the independent set. Additionally, through attention analysis and motif discovery, our model provided valuable insights into key sequence patterns and regions that are crucial for Kla modification. This work not only deepens the understanding of Kla's functional roles but also holds the potential to positively impact future research in protein modification prediction and functional annotation.
Collapse
Affiliation(s)
- Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Danhong Dong
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Qianchen Liu
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilin Guo
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilun Zhang
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, 1001 Daxue Road, Hsinchu 300093, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, 1001 Daxue Road, Hsinchu 300093, Taiwan.
| | - Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China.
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China.
| |
Collapse
|
13
|
Basith S, Sangaraju VK, Manavalan B, Lee G. mHPpred: Accurate identification of peptide hormones using multi-view feature learning. Comput Biol Med 2024; 183:109297. [PMID: 39442438 DOI: 10.1016/j.compbiomed.2024.109297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 10/04/2024] [Accepted: 10/15/2024] [Indexed: 10/25/2024]
Abstract
Peptide hormones were first used in medicine in the early 20th century, with the pivotal event being the isolation and purification of insulin in 1921. These hormones are integral to a sophisticated system that emerged early in evolution to regulate growth, development, and homeostasis. They serve as targeted signaling molecules that transfer specific information between cells and organs, ensuring coordinated and precise physiological responses. While experimental methods for identifying peptide hormones present challenges such as low abundance, stability issues, and complexity, computational methods offer promising alternatives. Advances in machine learning and bioinformatics have facilitated the prediction of peptide hormones, further enhancing their therapeutic potential. In this study, we explored three different computational frameworks for peptide hormone identification and determined that the meta-approach was the most suitable. Firstly, we evaluated the discriminative power of 26 feature descriptors using a series of baseline models and identified seven feature descriptors with high predictive potential. Through a systematic approach, we then selected the top 20 performing baseline models and integrated their predicted probabilities to train a meta-model, leveraging the strengths of multiple prediction strategies. Our final light gradient boosting-based meta-model, mHPpred, significantly outperformed the existing method, HOPPred, on both benchmarking and independent datasets. Notably, mHPpred also demonstrated superior performance compared to the hybrid and integrative framework approaches employed in this study. This superiority demonstrates the effectiveness of our multi-view feature learning strategy in capturing discriminative features and providing a more accurate prediction model for peptide hormones. mHPpred is publicly accessible at: https://balalab-skku.org/mHPpred.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon, 16499, Republic of Korea.
| | - Vinoth Kumar Sangaraju
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon, 16499, Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon, 16499, Republic of Korea.
| |
Collapse
|
14
|
Xu L, Zheng J, Zhou Y, Jia C. dsRNAPredictor-II: An improved predictor of identifying dsRNA and its silencing efficiency for Tribolium castaneum based on sequence length distribution. Methods 2024; 232:129-138. [PMID: 39528092 DOI: 10.1016/j.ymeth.2024.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 10/29/2024] [Accepted: 11/08/2024] [Indexed: 11/16/2024] Open
Abstract
RNA interference (RNAi) has been widely utilized to investigate gene functions and has significant potential for control of pest insects. However, recent studies have revealed that the target insect species, dsRNA molecule length, target genes, and other experimental factors can affect the efficiency of RNAi mediated control, restricting the further development and application of this technology. Therefore, the aim of this study was to establish a deep learning model using bioinformatics to help researchers identify dsRNA fragments with the highest RNAi efficiency. In this study, we optimized an existing model, namely, dsRNAPredictor, by designing sub-models based on different sequence lengths. Accordingly, the data were divided into two groups: 130-399 bp and 400-616 bp long sequences. Then, one-hot encoding was employed to extract sequence information. The convolutional neural network framework comprising three convolutional layers, three average pooling layers, a flattened layer, and three dense layers was employed as the classifier. By adjusting the parameters, we established two sub-models for different sequence distributions. Using multiple independent test datasets and conducting hypothesis testing, we demonstrated that our model exhibits superior performance and strong robustness to dsRNAPredictor, respectively. Therefore, our model may help design dsRNAs with pre-screening potential and facilitate further research and applications.
Collapse
Affiliation(s)
- Liping Xu
- School of Science, Dalian Maritime University, Dalian 116026, PR China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, PR China
| | - Yetong Zhou
- School of Science, Dalian Maritime University, Dalian 116026, PR China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, PR China.
| |
Collapse
|
15
|
Shaon MSH, Karim T, Ali MM, Ahmed K, Bui FM, Chen L, Moni MA. A robust deep learning approach for identification of RNA 5-methyluridine sites. Sci Rep 2024; 14:25688. [PMID: 39465261 PMCID: PMC11514282 DOI: 10.1038/s41598-024-76148-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 10/10/2024] [Indexed: 10/29/2024] Open
Abstract
RNA 5-methyluridine (m5U) sites play a significant role in understanding RNA modifications, which influence numerous biological processes such as gene expression and cellular functioning. Consequently, the identification of m5U sites can play a vital role in the integrity, structure, and function of RNA molecules. Therefore, this study introduces GRUpred-m5U, a novel deep learning-based framework based on a gated recurrent unit in mature RNA and full transcript RNA datasets. We used three descriptor groups: nucleic acid composition, pseudo nucleic acid composition, and physicochemical properties, which include five feature extraction methods ENAC, Kmer, DPCP, DPCP type 2, and PseDNC. Initially, we aggregated all the feature extraction methods and created a new merged set. Three hybrid models were developed employing deep-learning methods and evaluated through 10-fold cross-validation with seven evaluation metrics. After a comprehensive evaluation, the GRUpred-m5U model outperformed the other applied models, obtaining 98.41% and 96.70% accuracy on the two datasets, respectively. To our knowledge, the proposed model outperformed all the existing state-of-the-art technology. The proposed supervised machine learning model was evaluated using unsupervised machine learning techniques such as principal component analysis (PCA), and it was observed that the proposed method provided a valid performance for identifying m5U. Considering its multi-layered construction, the GRUpred-m5U model has tremendous potential for future applications in the biological industry. The model, which consisted of neurons processing complicated input, excelled at pattern recognition and produced reliable results. Despite its greater size, the model obtained accurate results, essential in detecting m5U.
Collapse
Affiliation(s)
| | - Tasmin Karim
- Department of Computer Science and Informatics, Oakland University, Rochester, MI, 48309, USA
| | - Md Mamun Ali
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Department of Software Engineering, Daffodil Smart City (DSC), Daffodil International University, Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
- Group of Bio-photomatiχ, Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, 1902, Tangail, Bangladesh.
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Dhaka, 1216, Birulia, Bangladesh.
| | - Francis M Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Li Chen
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Mohammad Ali Moni
- AI & Digital Health Technology, Artificial Intelligence & Cyber Future Institute, Charles Sturt University, Bathurst, NSW, 2795, Australia.
- AI & Digital Health Technology, Rural Health Research Institute, Charles Sturt University, Orange, NSW, 2800, Australia.
| |
Collapse
|
16
|
Feng C, Wei H, Xu C, Feng B, Zhu X, Liu J, Zou Q. iProps: A Comprehensive Software Tool for Protein Classification and Analysis With Automatic Machine Learning Capabilities and Model Interpretation Capabilities. IEEE J Biomed Health Inform 2024; 28:6237-6247. [PMID: 39008396 DOI: 10.1109/jbhi.2024.3425716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Protein classification is a crucial field in bioinformatics. The development of a comprehensive tool that can perform feature evaluation, visualization, automated machine learning, and model interpretation would significantly advance research in protein classification. However, there is a significant gap in the literature regarding tools that integrate all these essential functionalities. This paper presents iProps, a novel Python-based software package, meticulously crafted to fulfill these multifaceted requirements. iProps is distinguished by its proficiency in feature extraction, evaluation, automated machine learning, and interpretation of classification models. Firstly, iProps fully leverages evolutionary information and amino acid reduction information to propose or extend several numerical protein features that are independent of sequence length, including SC-PSSM, ORDip, TRC, CTDC-E, CKSAAGP-E, and so forth; at the same time, it also implements the calculation of 17 other numerical features within the software. iProps also provides feature combination operations for the aforementioned features to generate more hybrid features, and has added data balancing sampling processing as well as built-in classifier settings, among other functionalities. Thus, It can discern the most effective protein class recognition feature from a multitude of candidates, utilizing three automated machine learning algorithms to identify the most optimal classifiers and parameter settings. Furthermore, iProps generates a detailed explanatory report that includes 23 informative graphs derived from three interpretable models. To assess the performance of iProps, a series of numerical experiments were conducted using two well-established datasets. The results demonstrated that our software achieved superior recognition performance in every case. Beyond its contributions to bioinformatics, iProps broadens its applicability by offering robust data analysis tools that are beneficial across various disciplines, capitalizing on its automated machine learning and model interpretation capabilities. As an open-source platform, iProps is readily accessible and features an intuitive user interface, ensuring ease of use for individuals, even those without a background in programming.
Collapse
|
17
|
Qin Z, Ren H, Zhao P, Wang K, Liu H, Miao C, Du Y, Li J, Wu L, Chen Z. Current computational tools for protein lysine acylation site prediction. Brief Bioinform 2024; 25:bbae469. [PMID: 39316944 PMCID: PMC11421846 DOI: 10.1093/bib/bbae469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/20/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Yanxiu Du
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Liuji Wu
- National Key Laboratory of Wheat and Maize Crop Science, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| |
Collapse
|
18
|
Chung CR, Chien CY, Tang Y, Wu LC, Hsu JBK, Lu JJ, Lee TY, Bai C, Horng JT. An ensemble deep learning model for predicting minimum inhibitory concentrations of antimicrobial peptides against pathogenic bacteria. iScience 2024; 27:110718. [PMID: 39262770 PMCID: PMC11388163 DOI: 10.1016/j.isci.2024.110718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 07/09/2024] [Accepted: 08/08/2024] [Indexed: 09/13/2024] Open
Abstract
The rise of antibiotic resistance necessitates effective alternative therapies. Antimicrobial peptides (AMPs) are promising due to their broad inhibitory effects. This study focuses on predicting the minimum inhibitory concentration (MIC) of AMPs against whom-priority pathogens: Staphylococcus aureus ATCC 25923, Escherichia coli ATCC 25922, and Pseudomonas aeruginosa ATCC 27853. We developed a comprehensive regression model integrating AMP sequence-based and genomic features. Using eight AI-based architectures, including deep learning with protein language model embeddings, we created an ensemble model combining bi-directional long short-term memory (BiLSTM), convolutional neural network (CNN), and multi-branch model (MBM). The ensemble model showed superior performance with Pearson correlation coefficients of 0.756, 0.781, and 0.802 for the bacterial strains, demonstrating its accuracy in predicting MIC values. This work sets a foundation for future studies to enhance model performance and advance AMP applications in combating antibiotic resistance.
Collapse
Affiliation(s)
- Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Chung-Yu Chien
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Yun Tang
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Li-Ching Wu
- Department of Biomedical Sciences and Engineering, National Central University, Taoyuan, Taiwan
| | - Justin Bo-Kai Hsu
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Jang-Jih Lu
- Department of Laboratory Medicine, Chang Gung Memorial Hospital at Linkou, Taoyuan City, Taiwan
- School of Medicine, Chang Gung University, Taoyuan City, Taiwan
- Department of Medical Biotechnology and Laboratory Science, Chang Gung University, Taoyuan City, Taiwan
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Center for Intelligent Drug Systems and Smart Biodevices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Chen Bai
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong (Shenzhen), Shenzhen 518172, China
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Department of Laboratory Medicine, Chang Gung Memorial Hospital at Linkou, Taoyuan City, Taiwan
| |
Collapse
|
19
|
Qin Z, Liu H, Zhao P, Wang K, Ren H, Miao C, Li J, Chen YZ, Chen Z. SLAM: Structure-aware lysine β-hydroxybutyrylation prediction with protein language model. Int J Biol Macromol 2024; 280:135741. [PMID: 39293623 DOI: 10.1016/j.ijbiomac.2024.135741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/13/2024] [Accepted: 09/15/2024] [Indexed: 09/20/2024]
Abstract
Post-translational modifications (PTMs) diversify protein functions by adding/removing chemical groups to certain amino acid. As a newly-reported PTM, lysine β-hydroxybutyrylation (Kbhb) presents a new avenue to functional proteomics. Therefore, accurate and efficient prediction of Kbhb sites is imperative. However, the current experimental methods for identifying PTM sites are often expensive and time-consuming. Up to now, there is no computational method proposed for Kbhb sites detection. To this end, we present the first deep learning-based method, termed SLAM, to in silico identify lysine β-hydroxybutyrylation sites. The performance of SLAM is evaluated on both 5-fold cross-validation and independent test, achieving 0.890, 0.899, 0.907 and 0.923 in terms of AUROC values, on the general and species-specific independent test sets, respectively. As one example, we predicted the potential Kbhb sites in human S-adenosyl-L-homocysteine hydrolase, which is in agreement with experimentally-verified Kbhb sites. In summary, our method could enable accurate and efficient characterization of novel Kbhb sites that are crucial for the function and stability of proteins and could be applied in the structure-guided identification of other important PTM sites. The SLAM online service and source code is available at https://ai4bio.online/SLAM and https://github.com/Gabriel-QIN/SLAM, respectively.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China.
| | - Yong-Zi Chen
- Key Laboratory of Cancer Prevention and Therapy, Tianjin 300060, China; Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin's Clinical Research Center for Cancer, Tianjin 300060, China.
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China.
| |
Collapse
|
20
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
21
|
Guevara-Barrientos D, Kaundal R. Malivhu: A Comprehensive Bioinformatics Resource for Filtering SARS and MERS Virus Proteins by Their Classification, Family and Species, and Prediction of Their Interactions Against Human Proteins. Bioinform Biol Insights 2024; 18:11779322241263671. [PMID: 39148721 PMCID: PMC11325310 DOI: 10.1177/11779322241263671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 06/04/2024] [Indexed: 08/17/2024] Open
Abstract
COVID 19 pandemic is still ongoing, having taken more than 6 million human lives with it, and it seems that the world will have to learn how to live with the virus around. In consequence, there is a need to develop different treatments against it, not only with vaccines, but also new medicines. To do this, human-virus protein-protein interactions (PPIs) play a key part in drug-target discovery, but finding them experimentally can be either costly or sometimes unreliable. Therefore, computational methods arose as a powerful alternative to predict these interactions, reducing costs and helping researchers confirm only certain interactions instead of trying all possible combinations in the laboratory. Malivhu is a tool that predicts human-virus PPIs through a 4-phase process using machine learning models, where phase 1 filters ssRNA(+) class virus proteins, phase 2 filters Coronaviridae family proteins and phase 3 filters severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) species proteins, and phase 4 predicts human-SARS-CoV/SARS-CoV-2/MERS protein-protein interactions. The performance of the models was measured with Matthews correlation coefficient, F1-score, specificity, sensitivity, and accuracy scores, getting accuracies of 99.07%, 99.83%, and 100% for the first 3 phases, respectively, and 94.24% for human-SARS-CoV PPI, 94.50% for human-SARS-CoV-2 PPI, and 95.45% for human-MERS PPI on independent testing. All the prediction models developed for each of the 4 phases were implemented as web server which is freely available at https://kaabil.net/malivhu/.
Collapse
Affiliation(s)
- David Guevara-Barrientos
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
| | - Rakesh Kaundal
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
- Department of Plants, Soils & Climate, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
| |
Collapse
|
22
|
Basith S, Pham NT, Manavalan B, Lee G. SEP-AlgPro: An efficient allergen prediction tool utilizing traditional machine learning and deep learning techniques with protein language model features. Int J Biol Macromol 2024; 273:133085. [PMID: 38871100 DOI: 10.1016/j.ijbiomac.2024.133085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 05/20/2024] [Accepted: 06/09/2024] [Indexed: 06/15/2024]
Abstract
Allergy is a hypersensitive condition in which individuals develop objective symptoms when exposed to harmless substances at a dose that would cause no harm to a "normal" person. Most current computational methods for allergen identification rely on homology or conventional machine learning using limited set of feature descriptors or validation on specific datasets, making them inefficient and inaccurate. Here, we propose SEP-AlgPro for the accurate identification of allergen protein from sequence information. We analyzed 10 conventional protein-based features and 14 different features derived from protein language models to gauge their effectiveness in differentiating allergens from non-allergens using 15 different classifiers. However, the final optimized model employs top 10 feature descriptors with top seven machine learning classifiers. Results show that the features derived from protein language models exhibit superior discriminative capabilities compared to traditional feature sets. This enabled us to select the most discriminatory baseline models, whose predicted outputs were aggregated and used as input to a deep neural network for the final allergen prediction. Extensive case studies showed that SEP-AlgPro outperforms state-of-the-art predictors in accurately identifying allergens. A user-friendly web server was developed and made freely available at https://balalab-skku.org/SEP-AlgPro/, making it a powerful tool for identifying potential allergens.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea.
| |
Collapse
|
23
|
Pham NT, Terrance AT, Jeon YJ, Rakkiyappan R, Manavalan B. ac4C-AFL: A high-precision identification of human mRNA N4-acetylcytidine sites based on adaptive feature representation learning. MOLECULAR THERAPY. NUCLEIC ACIDS 2024; 35:102192. [PMID: 38779332 PMCID: PMC11108997 DOI: 10.1016/j.omtn.2024.102192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 04/18/2024] [Indexed: 05/25/2024]
Abstract
RNA N4-acetylcytidine (ac4C) is a highly conserved RNA modification that plays a crucial role in controlling mRNA stability, processing, and translation. Consequently, accurate identification of ac4C sites across the genome is critical for understanding gene expression regulation mechanisms. In this study, we have developed ac4C-AFL, a bioinformatics tool that precisely identifies ac4C sites from primary RNA sequences. In ac4C-AFL, we identified the optimal sequence length for model building and implemented an adaptive feature representation strategy that is capable of extracting the most representative features from RNA. To identify the most relevant features, we proposed a novel ensemble feature importance scoring strategy to rank features effectively. We then used this information to conduct the sequential forward search, which individually determine the optimal feature set from the 16 sequence-derived feature descriptors. Utilizing these optimal feature descriptors, we constructed 176 baseline models using 11 popular classifiers. The most efficient baseline models were identified using the two-step feature selection approach, whose predicted scores were integrated and trained with the appropriate classifier to develop the final prediction model. Our rigorous cross-validations and independent tests demonstrate that ac4C-AFL surpasses contemporary tools in predicting ac4C sites. Moreover, we have developed a publicly accessible web server at https://balalab-skku.org/ac4C-AFL/.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Annie Terrina Terrance
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| | - Rajan Rakkiyappan
- Department of Mathematics, Bharathiar University, Coimbatore, Tamil Nadu 641046, India
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Gyeonggi-do 16419, Republic of Korea
| |
Collapse
|
24
|
Gao M, Zhang D, Chen Y, Zhang Y, Wang Z, Wang X, Li S, Guo Y, Webb GI, Nguyen ATN, May L, Song J. GraphormerDTI: A graph transformer-based approach for drug-target interaction prediction. Comput Biol Med 2024; 173:108339. [PMID: 38547658 DOI: 10.1016/j.compbiomed.2024.108339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 03/05/2024] [Accepted: 03/17/2024] [Indexed: 04/17/2024]
Abstract
The application of Artificial Intelligence (AI) to screen drug molecules with potential therapeutic effects has revolutionized the drug discovery process, with significantly lower economic cost and time consumption than the traditional drug discovery pipeline. With the great power of AI, it is possible to rapidly search the vast chemical space for potential drug-target interactions (DTIs) between candidate drug molecules and disease protein targets. However, only a small proportion of molecules have labelled DTIs, consequently limiting the performance of AI-based drug screening. To solve this problem, a machine learning-based approach with great ability to generalize DTI prediction across molecules is desirable. Many existing machine learning approaches for DTI identification failed to exploit the full information with respect to the topological structures of candidate molecules. To develop a better approach for DTI prediction, we propose GraphormerDTI, which employs the powerful Graph Transformer neural network to model molecular structures. GraphormerDTI embeds molecular graphs into vector-format representations through iterative Transformer-based message passing, which encodes molecules' structural characteristics by node centrality encoding, node spatial encoding and edge encoding. With a strong structural inductive bias, the proposed GraphormerDTI approach can effectively infer informative representations for out-of-sample molecules and as such, it is capable of predicting DTIs across molecules with an exceptional performance. GraphormerDTI integrates the Graph Transformer neural network with a 1-dimensional Convolutional Neural Network (1D-CNN) to extract the drugs' and target proteins' representations and leverages an attention mechanism to model the interactions between them. To examine GraphormerDTI's performance for DTI prediction, we conduct experiments on three benchmark datasets, where GraphormerDTI achieves a superior performance than five state-of-the-art baselines for out-of-molecule DTI prediction, including GNN-CPI, GNN-PT, DeepEmbedding-DTI, MolTrans and HyperAttentionDTI, and is on a par with the best baseline for transductive DTI prediction. The source codes and datasets are publicly accessible at https://github.com/mengmeng34/GraphormerDTI.
Collapse
Affiliation(s)
- Mengmeng Gao
- School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Daokun Zhang
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Melbourne, Australia.
| | - Yi Chen
- School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
| | - Yiwen Zhang
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, 3004, Australia
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, 3004, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, 3004, Australia
| | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Melbourne, Australia
| | - Anh T N Nguyen
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Melbourne, Australia
| | - Lauren May
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Melbourne, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia.
| |
Collapse
|
25
|
Guan J, Yao L, Xie P, Chung CR, Huang Y, Chiang YC, Lee TY. A two-stage computational framework for identifying antiviral peptides and their functional types based on contrastive learning and multi-feature fusion strategy. Brief Bioinform 2024; 25:bbae208. [PMID: 38706321 PMCID: PMC11070730 DOI: 10.1093/bib/bbae208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 03/14/2024] [Accepted: 04/17/2024] [Indexed: 05/07/2024] Open
Abstract
Antiviral peptides (AVPs) have shown potential in inhibiting viral attachment, preventing viral fusion with host cells and disrupting viral replication due to their unique action mechanisms. They have now become a broad-spectrum, promising antiviral therapy. However, identifying effective AVPs is traditionally slow and costly. This study proposed a new two-stage computational framework for AVP identification. The first stage identifies AVPs from a wide range of peptides, and the second stage recognizes AVPs targeting specific families or viruses. This method integrates contrastive learning and multi-feature fusion strategy, focusing on sequence information and peptide characteristics, significantly enhancing predictive ability and interpretability. The evaluation results of the model show excellent performance, with accuracy of 0.9240 and Matthews correlation coefficient (MCC) score of 0.8482 on the non-AVP independent dataset, and accuracy of 0.9934 and MCC score of 0.9869 on the non-AMP independent dataset. Furthermore, our model can predict antiviral activities of AVPs against six key viral families (Coronaviridae, Retroviridae, Herpesviridae, Paramyxoviridae, Orthomyxoviridae, Flaviviridae) and eight viruses (FIV, HCV, HIV, HPIV3, HSV1, INFVA, RSV, SARS-CoV). Finally, to facilitate user accessibility, we built a user-friendly web interface deployed at https://awi.cuhk.edu.cn/∼dbAMP/AVP/.
Collapse
Affiliation(s)
- Jiahui Guan
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Lantian Yao
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, 320317 Taoyuan, Taiwan
| | - Yixian Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Ying-Chih Chiang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, 300093 Hsinchu, Taiwan
- Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, 300093 Hsinchu, Taiwan
| |
Collapse
|
26
|
Biró B, Gál Z, Fekete Z, Klecska E, Hoffmann OI. Mitochondrial genome plasticity of mammalian species. BMC Genomics 2024; 25:278. [PMID: 38486136 PMCID: PMC10941376 DOI: 10.1186/s12864-024-10201-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 03/08/2024] [Indexed: 03/17/2024] Open
Abstract
There is an ongoing process in which mitochondrial sequences are being integrated into the nuclear genome. The importance of these sequences has already been revealed in cancer biology, forensic, phylogenetic studies and in the evolution of the eukaryotic genetic information. Human and numerous model organisms' genomes were described from those sequences point of view. Furthermore, recent studies were published on the patterns of these nuclear localised mitochondrial sequences in different taxa.However, the results of the previously released studies are difficult to compare due to the lack of standardised methods and/or using few numbers of genomes. Therefore, in this paper our primary goal is to establish a uniform mining pipeline to explore these nuclear localised mitochondrial sequences.Our results show that the frequency of several repetitive elements is higher in the flanking regions of these sequences than expected. A machine learning model reveals that the flanking regions' repetitive elements and different structural characteristics are highly influential during the integration process.In this paper, we introduce a general mining pipeline for all mammalian genomes. The workflow is publicly available and is believed to serve as a validated baseline for future research in this field. We confirm the widespread opinion, on - as to our current knowledge - the largest dataset, that structural circumstances and events corresponding to repetitive elements are highly significant. An accurate model has also been trained to predict these sequences and their corresponding flanking regions.
Collapse
Affiliation(s)
- Bálint Biró
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary.
- Group BM, Data Insights Team, _VOIS, Kerepesi str. 35, 1087, Budapest, Hungary.
| | - Zoltán Gál
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary
| | - Zsófia Fekete
- Department of Genetics and Genomics, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary
| | - Eszter Klecska
- FamiCord Group, Krio Institute, Kelemen László str, 1026, Budapest, Hungary
| | - Orsolya Ivett Hoffmann
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary.
| |
Collapse
|
27
|
Meng J, Liu J, Song W, Li H, Wang J, Zhang L, Peng Y, Wu A, Jiang T. PREDAC-CNN: predicting antigenic clusters of seasonal influenza A viruses with convolutional neural network. Brief Bioinform 2024; 25:bbae033. [PMID: 38343322 PMCID: PMC10859661 DOI: 10.1093/bib/bbae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Revised: 01/13/2024] [Accepted: 01/18/2024] [Indexed: 02/15/2024] Open
Abstract
Vaccination stands as the most effective and economical strategy for prevention and control of influenza. The primary target of neutralizing antibodies is the surface antigen hemagglutinin (HA). However, ongoing mutations in the HA sequence result in antigenic drift. The success of a vaccine is contingent on its antigenic congruence with circulating strains. Thus, predicting antigenic variants and deducing antigenic clusters of influenza viruses are pivotal for recommendation of vaccine strains. The antigenicity of influenza A viruses is determined by the interplay of amino acids in the HA1 sequence. In this study, we exploit the ability of convolutional neural networks (CNNs) to extract spatial feature representations in the convolutional layers, which can discern interactions between amino acid sites. We introduce PREDAC-CNN, a model designed to track antigenic evolution of seasonal influenza A viruses. Accessible at http://predac-cnn.cloudna.cn, PREDAC-CNN formulates a spatially oriented representation of the HA1 sequence, optimized for the convolutional framework. It effectively probes interactions among amino acid sites in the HA1 sequence. Also, PREDAC-CNN focuses exclusively on physicochemical attributes crucial for the antigenicity of influenza viruses, thereby eliminating unnecessary amino acid embeddings. Together, PREDAC-CNN is adept at capturing interactions of amino acid sites within the HA1 sequence and examining the collective impact of point mutations on antigenic variation. Through 5-fold cross-validation and retrospective testing, PREDAC-CNN has shown superior performance in predicting antigenic variants compared to its counterparts. Additionally, PREDAC-CNN has been instrumental in identifying predominant antigenic clusters for A/H3N2 (1968-2023) and A/H1N1 (1977-2023) viruses, significantly aiding in vaccine strain recommendation.
Collapse
Affiliation(s)
- Jing Meng
- State Key Laboratory of Common Mechanism Research for Major Diseases, Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Suzhou 215123, Jiangsu, China
| | - Jingze Liu
- State Key Laboratory of Common Mechanism Research for Major Diseases, Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Suzhou 215123, Jiangsu, China
| | - Wenkai Song
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Honglei Li
- Beijing Cloudna Technology Company, Limited, Beijing 100029, China
| | | | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Yousong Peng
- College of Biology, Hunan University, Changsha 410082, China
| | - Aiping Wu
- State Key Laboratory of Common Mechanism Research for Major Diseases, Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Suzhou 215123, Jiangsu, China
| | - Taijiao Jiang
- State Key Laboratory of Common Mechanism Research for Major Diseases, Suzhou Institute of Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Suzhou 215123, Jiangsu, China
- Guangzhou National Laboratory, Guangzhou 510005, China
- State Key Laboratory of Respiratory Disease, the First Affiliated Hospital of Guangzhou Medical University, Guangzhou Medical University, Guangzhou, 510120, China
| |
Collapse
|
28
|
Du Z, Ding X, Hsu W, Munir A, Xu Y, Li Y. pLM4ACE: A protein language model based predictor for antihypertensive peptide screening. Food Chem 2024; 431:137162. [PMID: 37604011 DOI: 10.1016/j.foodchem.2023.137162] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 08/09/2023] [Accepted: 08/13/2023] [Indexed: 08/23/2023]
Abstract
Angiotensin-I converting enzyme (ACE) regulates the renin-angiotensin system and is a drug target in clinical treatment for hypertension. This study aims to develop a protein language model (pLM) with evolutionary scale modeling (ESM-2) embeddings that is trained on experimental data to screen peptides with strong ACE inhibitory activity. Twelve conventional peptide embedding approaches and five machine learning (ML) modeling methods were also tested for performance comparison. Among the 65 classifiers tested, logistic regression with ESM-2 embeddings showed the best performance, with balanced accuracy (BACC), Matthews correlation coefficient (MCC), and area under the curve of 0.883 ± 0.017, 0.77 ± 0.032, and 0.96 ± 0.009, respectively. Multilayer perceptron and support vector machine also exhibited great compatibility with ESM-2 embeddings. The ESM-2 embeddings showed superior performance in enhancing the prediction model compared to the 12 traditional embedding methods. A user-friendly webserver (https://sqzujiduce.us-east-1.awsapprunner.com) with the top three models is now freely available.
Collapse
Affiliation(s)
- Zhenjiao Du
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA
| | - Xingjian Ding
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - William Hsu
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - Arslan Munir
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - Yixiang Xu
- Healthy Processed Foods Research Unit, Western Regional Research Center, USDA-ARS, 800 Buchanan Street, Albany, CA 94710, USA
| | - Yonghui Li
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.
| |
Collapse
|
29
|
Haselbeck F, John M, Zhang Y, Pirnay J, Fuenzalida-Werner J, Costa R, Grimm D. Superior protein thermophilicity prediction with protein language model embeddings. NAR Genom Bioinform 2023; 5:lqad087. [PMID: 37829176 PMCID: PMC10566323 DOI: 10.1093/nargab/lqad087] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/14/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023] Open
Abstract
Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew's correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
Collapse
Affiliation(s)
- Florian Haselbeck
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, 94315 Straubing, Germany
| | - Maura John
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, 94315 Straubing, Germany
| | - Yuqi Zhang
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany
| | - Jonathan Pirnay
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, 94315 Straubing, Germany
| | - Juan Pablo Fuenzalida-Werner
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Chair of Biogenic Functional Materials, 94315 Straubing, Germany
| | - Rubén D Costa
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Chair of Biogenic Functional Materials, 94315 Straubing, Germany
| | - Dominik G Grimm
- Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, 94315 Straubing, Germany
- Weihenstephan-Triesdorf University of Applied Sciences, Bioinformatics, 94315 Straubing, Germany
- Technical University of Munich, TUM School of Computation, Information and Technology (CIT), 85748 Garching, Germany
| |
Collapse
|
30
|
Pham NT, Phan LT, Seo J, Kim Y, Song M, Lee S, Jeon YJ, Manavalan B. Advancing the accuracy of SARS-CoV-2 phosphorylation site detection via meta-learning approach. Brief Bioinform 2023; 25:bbad433. [PMID: 38058187 PMCID: PMC10753650 DOI: 10.1093/bib/bbad433] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 10/30/2023] [Accepted: 11/05/2023] [Indexed: 12/08/2023] Open
Abstract
The worldwide appearance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has generated significant concern and posed a considerable challenge to global health. Phosphorylation is a common post-translational modification that affects many vital cellular functions and is closely associated with SARS-CoV-2 infection. Precise identification of phosphorylation sites could provide more in-depth insight into the processes underlying SARS-CoV-2 infection and help alleviate the continuing COVID-19 crisis. Currently, available computational tools for predicting these sites lack accuracy and effectiveness. In this study, we designed an innovative meta-learning model, Meta-Learning for Serine/Threonine Phosphorylation (MeL-STPhos), to precisely identify protein phosphorylation sites. We initially performed a comprehensive assessment of 29 unique sequence-derived features, establishing prediction models for each using 14 renowned machine learning methods, ranging from traditional classifiers to advanced deep learning algorithms. We then selected the most effective model for each feature by integrating the predicted values. Rigorous feature selection strategies were employed to identify the optimal base models and classifier(s) for each cell-specific dataset. To the best of our knowledge, this is the first study to report two cell-specific models and a generic model for phosphorylation site prediction by utilizing an extensive range of sequence-derived features and machine learning algorithms. Extensive cross-validation and independent testing revealed that MeL-STPhos surpasses existing state-of-the-art tools for phosphorylation site prediction. We also developed a publicly accessible platform at https://balalab-skku.org/MeL-STPhos. We believe that MeL-STPhos will serve as a valuable tool for accelerating the discovery of serine/threonine phosphorylation sites and elucidating their role in post-translational regulation.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Jimin Seo
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Yeonwoo Kim
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Minkyung Song
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Sukchan Lee
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology and of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
31
|
Basith S, Pham NT, Song M, Lee G, Manavalan B. ADP-Fuse: A novel two-layer machine learning predictor to identify antidiabetic peptides and diabetes types using multiview information. Comput Biol Med 2023; 165:107386. [PMID: 37619323 DOI: 10.1016/j.compbiomed.2023.107386] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/03/2023] [Accepted: 08/14/2023] [Indexed: 08/26/2023]
Abstract
Diabetes mellitus has become a major public health concern associated with high mortality and reduced life expectancy and can cause blindness, heart attacks, kidney failure, lower limb amputations, and strokes. A new generation of antidiabetic peptides (ADPs) that act on β-cells or T-cells to regulate insulin production is being developed to alleviate the effects of diabetes. However, the lack of effective peptide-mining tools has hampered the discovery of these promising drugs. Hence, novel computational tools need to be developed urgently. In this study, we present ADP-Fuse, a novel two-layer prediction framework capable of accurately identifying ADPs or non-ADPs and categorizing them into type 1 and type 2 ADPs. First, we comprehensively evaluated 22 peptide sequence-derived features coupled with eight notable machine learning algorithms. Subsequently, the most suitable feature descriptors and classifiers for both layers were identified. The output of these single-feature models, embedded with multiview information, was trained with an appropriate classifier to provide the final prediction. Comprehensive cross-validation and independent tests substantiate that ADP-Fuse surpasses single-feature models and the feature fusion approach for the prediction of ADPs and their types. In addition, the SHapley Additive exPlanation method was used to elucidate the contributions of individual features to the prediction of ADPs and their types. Finally, a user-friendly web server for ADP-Fuse was developed and made publicly accessible (https://balalab-skku.org/ADP-Fuse), enabling the swift screening and identification of novel ADPs and their types. This framework is expected to contribute significantly to antidiabetic peptide identification.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon, 16499, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Minkyung Song
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea; Department of Biopharmaceutical Convergence, Sungkyunkwan University, Suwon, 16419, Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon, 16499, Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon, 16499, Republic of Korea.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea.
| |
Collapse
|
32
|
Liang Z, Liu T, Li Q, Zhang G, Zhang B, Du X, Liu J, Chen Z, Ding H, Hu G, Lin H, Zhu F, Luo C. Deciphering the functional landscape of phosphosites with deep neural network. Cell Rep 2023; 42:113048. [PMID: 37659078 DOI: 10.1016/j.celrep.2023.113048] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/11/2023] [Accepted: 08/11/2023] [Indexed: 09/04/2023] Open
Abstract
Current biochemical approaches have only identified the most well-characterized kinases for a tiny fraction of the phosphoproteome, and the functional assignments of phosphosites are almost negligible. Herein, we analyze the substrate preference catalyzed by a specific kinase and present a novel integrated deep neural network model named FuncPhos-SEQ for functional assignment of human proteome-level phosphosites. FuncPhos-SEQ incorporates phosphosite motif information from a protein sequence using multiple convolutional neural network (CNN) channels and network features from protein-protein interactions (PPIs) using network embedding and deep neural network (DNN) channels. These concatenated features are jointly fed into a heterogeneous feature network to prioritize functional phosphosites. Combined with a series of in vitro and cellular biochemical assays, we confirm that NADK-S48/50 phosphorylation could activate its enzymatic activity. In addition, ERK1/2 are discovered as the primary kinases responsible for NADK-S48/50 phosphorylation. Moreover, FuncPhos-SEQ is developed as an online server.
Collapse
Affiliation(s)
- Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Tonghai Liu
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Qi Li
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guangyu Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Bei Zhang
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Xikun Du
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Jingqiu Liu
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhifeng Chen
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Hong Ding
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Hao Lin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Cheng Luo
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China; School of Life Science and Technology, Shanghai Tech University, 100 Haike Road, Shanghai 201210, China; School of Pharmacy, Fujian Medical University, Fuzhou 350122, China.
| |
Collapse
|
33
|
Sun J, Kulandaisamy A, Ru J, Gromiha MM, Cribbs AP. TMKit: a Python interface for computational analysis of transmembrane proteins. Brief Bioinform 2023; 24:bbad288. [PMID: 37594311 PMCID: PMC10516361 DOI: 10.1093/bib/bbad288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 07/07/2023] [Accepted: 07/18/2023] [Indexed: 08/19/2023] Open
Abstract
Transmembrane proteins are receptors, enzymes, transporters and ion channels that are instrumental in regulating a variety of cellular activities, such as signal transduction and cell communication. Despite tremendous progress in computational capacities to support protein research, there is still a significant gap in the availability of specialized computational analysis toolkits for transmembrane protein research. Here, we introduce TMKit, an open-source Python programming interface that is modular, scalable and specifically designed for processing transmembrane protein data. TMKit is a one-stop computational analysis tool for transmembrane proteins, enabling users to perform database wrangling, engineer features at the mutational, domain and topological levels, and visualize protein-protein interaction interfaces. In addition, TMKit includes seqNetRR, a high-performance computing library that allows customized construction of a large number of residue connections. This library is particularly well suited for assigning correlation matrix-based features at a fast speed. TMKit should serve as a useful tool for researchers in assisting the study of transmembrane protein sequences and structures. TMKit is publicly available through https://github.com/2003100127/tmkit and https://tmkit-guide.herokuapp.com/doc/overview.
Collapse
Affiliation(s)
- Jianfeng Sun
- Nuffield Department of Orthopedics, Rheumatology, and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Headington, Oxford OX3 7LD, UK
| | - Arulsamy Kulandaisamy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600036, Tamil Nadu, India
| | - Jinlong Ru
- Chair of Prevention of Microbial Diseases, School of Life Sciences Weihenstephan, Technical University of Munich, 85354 Freising, Germany
| | - M Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600036, Tamil Nadu, India
| | - Adam P Cribbs
- Nuffield Department of Orthopedics, Rheumatology, and Musculoskeletal Sciences, Botnar Research Centre, University of Oxford, Headington, Oxford OX3 7LD, UK
| |
Collapse
|
34
|
Sun J, Xu M, Ru J, James-Bott A, Xiong D, Wang X, Cribbs AP. Small molecule-mediated targeting of microRNAs for drug discovery: Experiments, computational techniques, and disease implications. Eur J Med Chem 2023; 257:115500. [PMID: 37262996 PMCID: PMC11554572 DOI: 10.1016/j.ejmech.2023.115500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/05/2023] [Accepted: 05/15/2023] [Indexed: 06/03/2023]
Abstract
Small molecules have been providing medical breakthroughs for human diseases for more than a century. Recently, identifying small molecule inhibitors that target microRNAs (miRNAs) has gained importance, despite the challenges posed by labour-intensive screening experiments and the significant efforts required for medicinal chemistry optimization. Numerous experimentally-verified cases have demonstrated the potential of miRNA-targeted small molecule inhibitors for disease treatment. This new approach is grounded in their posttranscriptional regulation of the expression of disease-associated genes. Reversing dysregulated gene expression using this mechanism may help control dysfunctional pathways. Furthermore, the ongoing improvement of algorithms has allowed for the integration of computational strategies built on top of laboratory-based data, facilitating a more precise and rational design and discovery of lead compounds. To complement the use of extensive pharmacogenomics data in prioritising potential drugs, our previous work introduced a computational approach based on only molecular sequences. Moreover, various computational tools for predicting molecular interactions in biological networks using similarity-based inference techniques have been accumulated in established studies. However, there are a limited number of comprehensive reviews covering both computational and experimental drug discovery processes. In this review, we outline a cohesive overview of both biological and computational applications in miRNA-targeted drug discovery, along with their disease implications and clinical significance. Finally, utilizing drug-target interaction (DTIs) data from DrugBank, we showcase the effectiveness of deep learning for obtaining the physicochemical characterization of DTIs.
Collapse
Affiliation(s)
- Jianfeng Sun
- Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, OX3 7LD, UK.
| | - Miaoer Xu
- Department of Biology, Emory University, Atlanta, GA, 30322, USA
| | - Jinlong Ru
- Chair of Prevention of Microbial Diseases, School of Life Sciences Weihenstephan, Technical University of Munich, Freising, 85354, Germany
| | - Anna James-Bott
- Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, OX3 7LD, UK
| | - Dapeng Xiong
- Department of Computational Biology, Cornell University, Ithaca, NY, 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY, 14853, USA
| | - Xia Wang
- College of Animal Science and Technology, Northwest A&F University, Yangling, 712100, China.
| | - Adam P Cribbs
- Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, OX3 7LD, UK.
| |
Collapse
|
35
|
Chen R, Li F, Guo X, Bi Y, Li C, Pan S, Coin LJM, Song J. ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species. Brief Bioinform 2023; 24:bbad170. [PMID: 37150785 PMCID: PMC10565902 DOI: 10.1093/bib/bbad170] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 04/12/2023] [Accepted: 04/14/2023] [Indexed: 05/09/2023] Open
Abstract
A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.
Collapse
Affiliation(s)
- Ruyi Chen
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Shirui Pan
- School of Information and Communication Technology, Griffith University, QLD 4222, Australia
| | - Lachlan J M Coin
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, VIC 3800, Australia
| |
Collapse
|
36
|
Malik A, Shoombuatong W, Kim CB, Manavalan B. GPApred: The first computational predictor for identifying proteins with LPXTG-like motif using sequence-based optimal features. Int J Biol Macromol 2023; 229:529-538. [PMID: 36596370 DOI: 10.1016/j.ijbiomac.2022.12.315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/19/2022] [Accepted: 12/28/2022] [Indexed: 01/02/2023]
Abstract
The cell surface proteins of gram-positive bacteria are involved in many important biological functions, including the infection of host cells. Owing to their virulent nature, these proteins are also considered strong candidates for potential drug or vaccine targets. Among the various cell surface proteins of gram-positive bacteria, LPXTG-like proteins form a major class. These proteins have a highly conserved C-terminal cell wall sorting signal, which consists of an LPXTG sequence motif, a hydrophobic domain, and a positively charged tail. These surface proteins are targeted to the cell envelope by a sortase enzyme via transpeptidation. A variety of LPXTG-like proteins have been experimentally characterized; however, their number in public databases has increased owing to extensive bacterial genome sequencing without proper annotation. In the absence of experimental characterization, identifying and annotating these sequences is extremely challenging. Therefore, in this study, we developed the first machine learning-based predictor called GPApred, which can identify LPXTG-like proteins from their primary sequences. Using a newly constructed benchmark dataset, we explored different classifiers and five feature encodings and their hybrids. Optimal features were derived using the recursive feature elimination method, and these features were then trained using a support vector machine algorithm. The performance of different models was evaluated using independent datasets, and a final model (GPApred) was selected based on consistency during cross-validation and independent assessment. GPApred can be an effective tool for predicting LPXTG-like sequences and can be further employed for functional characterization or drug targeting. Availability: https://procarb.org/gpapred/.
Collapse
Affiliation(s)
- Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul 03016, Republic of Korea
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Chang-Bae Kim
- Department of Biotechnology, Sangmyung University, Seoul 03016, Republic of Korea.
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
37
|
Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, Mishra G, Kaur H, Sharma N, Jain S, Usmani SS, Agrawal P, Kumar R, Kumar V, Raghava GPS. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol 2023; 30:204-222. [PMID: 36251780 DOI: 10.1089/cmb.2022.0241] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
In the last three decades, a wide range of protein features have been discovered to annotate a protein. Numerous attempts have been made to integrate these features in a software package/platform so that the user may compute a wide range of features from a single source. To complement the existing methods, we developed a method, Pfeature, for computing a wide range of protein features. Pfeature allows to compute more than 200,000 features required for predicting the overall function of a protein, residue-level annotation of a protein, and function of chemically modified peptides. It has six major modules, namely, composition, binary profiles, evolutionary information, structural features, patterns, and model building. Composition module facilitates to compute most of the existing compositional features, plus novel features. The binary profile of amino acid sequences allows to compute the fraction of each type of residue as well as its position. The evolutionary information module allows to compute evolutionary information of a protein in the form of a position-specific scoring matrix profile generated using Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST); fit for annotation of a protein and its residues. A structural module was developed for computing of structural features/descriptors from a tertiary structure of a protein. These features are suitable to predict the therapeutic potential of a protein containing non-natural or chemically modified residues. The model-building module allows to implement various machine learning techniques for developing classification and regression models as well as feature selection. Pfeature also allows the generation of overlapping patterns and features from a protein. A user-friendly Pfeature is available as a web server python library and stand-alone package.
Collapse
Affiliation(s)
- Akshara Pande
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Lathwal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Chakit Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Dilraj Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Gaurav Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Department of Electrical Engineering, Shiv Nadar University, Greater Noida, India
| | - Harpreet Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Shipra Jain
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Salman Sadullah Usmani
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Rajesh Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Vinod Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
38
|
Preeti P, Nath SK, Arambam N, Sharma T, Choudhury PR, Choudhury A, Khanna V, Strych U, Hotez PJ, Bottazzi ME, Rawal K. Vaxi-DL: An Artificial Intelligence-Enabled Platform for Vaccine Development. Methods Mol Biol 2023; 2673:305-316. [PMID: 37258923 DOI: 10.1007/978-1-0716-3239-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Vaccine development is a complex and long process. It involves several steps, including computational studies, experimental analyses, animal model system studies, and clinical trials. This process can be accelerated by using in silico antigen screening to identify potential vaccine candidates. In this chapter, we describe a deep learning-based technique which utilizes 18 biological and 9154 physicochemical properties of proteins for finding potential vaccine candidates. Using this technique, a new web-based system, named Vaxi-DL, was developed which helped in finding new vaccine candidates from bacteria, protozoa, viruses, and fungi. Vaxi-DL is available at: https://vac.kamalrawal.in/vaxidl/ .
Collapse
Affiliation(s)
- P Preeti
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Swarsat Kaushik Nath
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Nevidita Arambam
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Trapti Sharma
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Priyanka Ray Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Alakto Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Vrinda Khanna
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Ulrich Strych
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
| | - Peter J Hotez
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Maria Elena Bottazzi
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Kamal Rawal
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India.
| |
Collapse
|
39
|
Ebrahimi Tarki F, Zarrabi M, Abdiali A, Sharbatdar M. Integration of Machine Learning and Structural Analysis for Predicting Peptide Antibiofilm Effects: Advancements in Drug Discovery for Biofilm-Related Infections. IRANIAN JOURNAL OF PHARMACEUTICAL RESEARCH : IJPR 2023; 22:e138704. [PMID: 38450220 PMCID: PMC10916117 DOI: 10.5812/ijpr-138704] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 08/22/2023] [Accepted: 08/26/2023] [Indexed: 03/08/2024]
Abstract
Background The rise of antibiotic resistance has become a major concern, signaling the end of the golden age of antibiotics. Bacterial biofilms, which exhibit high resistance to antibiotics, significantly contribute to the emergence of antibiotic resistance. Therefore, there is an urgent need to discover new therapeutic agents with specific characteristics to effectively combat biofilm-related infections. Studies have shown the promising potential of peptides as antimicrobial agents. Objectives This study aimed to establish a cost-effective and streamlined computational method for predicting the antibiofilm effects of peptides. This method can assist in addressing the intricate challenge of designing peptides with strong antibiofilm properties, a task that can be both challenging and costly. Methods A positive library, consisting of peptide sequences with antibiofilm activity exceeding 50%, was assembled, along with a negative library containing quorum-sensing peptides. For each peptide sequence, feature vectors were calculated, while considering the primary structure, the order of amino acids, their physicochemical properties, and their distributions. Multiple supervised learning algorithms were used to classify peptides with significant antibiofilm effects for subsequent experimental evaluations. Results The computational approach exhibited high accuracy in predicting the antibiofilm effects of peptides, with accuracy, precision, Matthew's correlation coefficient (MCC), and F1 score of 99%, 99%, 0.97, and 0.99, respectively. The performance level of this computational approach was comparable to that of previous methods. This study introduced a novel approach by combining the feature space with high antibiofilm activity. Conclusions In this study, a reliable and cost-effective method was developed for predicting the antibiofilm effects of peptides using a computational approach. This approach allows for the identification of peptide sequences with substantial antibiofilm activities for further experimental investigations. Accessible source codes and raw data of this study can be found online (hiABF), providing easy access and enabling future updates.
Collapse
Affiliation(s)
- Fatemeh Ebrahimi Tarki
- Department of Biotechnology, Faculty of Biological Sciences, Alzahra University, Tehran, Iran
| | - Mahboobeh Zarrabi
- Department of Biotechnology, Faculty of Biological Sciences, Alzahra University, Tehran, Iran
| | - Ahya Abdiali
- Department of Microbiology, Faculty of Biological Sciences, Alzahra University, Tehran, Iran
| | - Mahkame Sharbatdar
- Department of Mechanical Engineering, Khajeh Nasir Toosi University of Technology, Tehran, Iran
| |
Collapse
|
40
|
Guevara-Barrientos D, Kaundal R. ProFeatX: A parallelized protein feature extraction suite for machine learning. Comput Struct Biotechnol J 2022; 21:796-801. [PMID: 36698978 PMCID: PMC9842958 DOI: 10.1016/j.csbj.2022.12.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 12/26/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022] Open
Abstract
Machine learning algorithms have been successfully applied in proteomics, genomics and transcriptomics. and have helped the biological community to answer complex questions. However, most machine learning methods require lots of data, with every data point having the same vector size. The biological sequence data, such as proteins, are amino acid sequences of variable length, which makes it essential to extract a definite number of features from all the proteins for them to be used as input into machine learning models. There are numerous methods to achieve this, but only several tools let researchers encode their proteins using multiple schemes without having to use different programs or, in many cases, code these algorithms themselves, or even come up with new algorithms. In this work, we created ProFeatX, a tool that contains 50 encodings to extract protein features in an efficient and fast way supporting desktop as well as high-performance computing environment. It can also encode concatenated features for protein-protein interactions. The tool has an easy-to-use web interface, allowing non-experts to use feature extraction techniques, as well as a stand-alone version for advanced users. ProFeatX is implemented in C++ and available on GitHub at https://github.com/usubioinfo/profeatx. The web server is available at http://bioinfo.usu.edu/profeatx/.
Collapse
Affiliation(s)
- David Guevara-Barrientos
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
| | - Rakesh Kaundal
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
- Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
| |
Collapse
|
41
|
Bi Y, Li F, Guo X, Wang Z, Pan T, Guo Y, Webb GI, Yao J, Jia C, Song J. Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations. Brief Bioinform 2022; 23:bbac467. [PMID: 36341591 PMCID: PMC10148739 DOI: 10.1093/bib/bbac467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 09/09/2022] [Accepted: 09/29/2022] [Indexed: 11/09/2022] Open
Abstract
Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.
Collapse
Affiliation(s)
- Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tong Pan
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia
| | - Geoffrey I Webb
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | | | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
42
|
RNADSN: Transfer-Learning 5-Methyluridine (m5U) Modification on mRNAs from Common Features of tRNA. Int J Mol Sci 2022; 23:ijms232113493. [PMID: 36362279 PMCID: PMC9655583 DOI: 10.3390/ijms232113493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Revised: 09/24/2022] [Accepted: 09/29/2022] [Indexed: 11/06/2022] Open
Abstract
One of the most abundant non-canonical bases widely occurring on various RNA molecules is 5-methyluridine (m5U). Recent studies have revealed its influences on the development of breast cancer, systemic lupus erythematosus, and the regulation of stress responses. The accurate identification of m5U sites is crucial for understanding their biological functions. We propose RNADSN, the first transfer learning deep neural network that learns common features between tRNA m5U and mRNA m5U to enhance the prediction of mRNA m5U. Without seeing the experimentally detected mRNA m5U sites, RNADSN has already outperformed the state-of-the-art method, m5UPred. Using mRNA m5U classification as an additional layer of supervision, our model achieved another distinct improvement and presented an average area under the receiver operating characteristic curve (AUC) of 0.9422 and an average precision (AP) of 0.7855. The robust performance of RNADSN was also verified by cross-technical and cross-cellular validation. The interpretation of RNADSN also revealed the sequence motif of common features. Therefore, RNADSN should be a useful tool for studying m5U modification.
Collapse
|