1
|
Hong S, Chattaraj KG, Guo J, Trout BL, Braatz RD. Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment. Bioinformatics 2025; 41:btaf034. [PMID: 39878910 PMCID: PMC11814488 DOI: 10.1093/bioinformatics/btaf034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 11/29/2024] [Accepted: 01/26/2025] [Indexed: 01/31/2025] Open
Abstract
MOTIVATION The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model. RESULTS Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation. AVAILABILITY AND IMPLEMENTATION The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.
Collapse
Affiliation(s)
- Seokyoung Hong
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Krishna Gopal Chattaraj
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Jing Guo
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Bernhardt L Trout
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Richard D Braatz
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| |
Collapse
|
2
|
Hou C, Li W, Li Y, Ma J. O-GlcNAc informatics: advances and trends. Anal Bioanal Chem 2025; 417:895-905. [PMID: 39294469 DOI: 10.1007/s00216-024-05531-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 08/29/2024] [Accepted: 09/03/2024] [Indexed: 09/20/2024]
Abstract
As a post-translational modification, protein glycosylation is critical in health and disease. O-Linked β-N-acetylglucosamine (O-GlcNAc) modification (O-GlcNAcylation), as an intracellular monosaccharide modification on proteins, was discovered 40 years ago. Thanks to technological advances, the physiological and pathological significance of O-GlcNAcylation has been gradually revealed and widely appreciated, especially in recent years. O-GlcNAc informatics has been quickly evolving. Clearly, O-GlcNAc informatics tools have not only facilitated O-GlcNAc functional studies, but also provided us a unique perspective on protein O-GlcNAcylation. In this article, we review O-GlcNAc-focused software tools and servers that have been developed for O-GlcNAc research over the past four decades. Specifically, we will (1) survey bioinformatics tools that have facilitated O-GlcNAc proteomics data analysis, (2) introduce databases/servers for O-GlcNAc proteins/sites that have been experimentally identified by individual research labs, (3) describe software tools that have been developed to predict O-GlcNAc sites, and (4) introduce platforms cataloging proteins that interact with the O-GlcNAc cycling enzymes (i.e., O-GlcNAc transferase and O-GlcNAcase). We hope these resources will provide useful information to both experienced researchers and new incomers to the O-GlcNAc field. We anticipate that this review provides a framework to stimulate the future development of more sophisticated informatic tools for O-GlcNAc research.
Collapse
Affiliation(s)
- Chunyan Hou
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, 20007, USA
| | - Weiyu Li
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, 20007, USA
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Yaoxiang Li
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, 20007, USA
| | - Junfeng Ma
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, 20007, USA.
| |
Collapse
|
3
|
Khalid A, Kaleem A, Qazi W, Abdullah R, Iqtedar M, Naz S. Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model. PLoS One 2024; 19:e0316215. [PMID: 39739642 DOI: 10.1371/journal.pone.0316215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 12/07/2024] [Indexed: 01/02/2025] Open
Abstract
Protein glycosylation, a vital post-translational modification, is pivotal in various biological processes and disease pathogenesis. Computational approaches, including protein language models and machine learning algorithms, have emerged as valuable tools for predicting O-GlcNAc sites, reducing experimental costs, and enhancing efficiency. However, the literature has not reported the prediction of O-GlcNAc sites through the evolutionary scale model (ESM). Therefore, this study employed the ESM-2 model for O-GlcNAc site prediction in humans. Approximately 1100 O-linked glycoprotein sequences retrieved from the O-GlcNAc database were utilized for model training. The ESM-2 model exhibited consistent improvement over epochs, achieving an accuracy of 78.30%, recall of 78.30%, precision of 61.31%, and F1-score of 68.74%. However, compared to the traditional models which show an overfitting on the same data up to 99%, ESM-2 model outperforms in terms of optimal training and testing predictions. These findings underscore the effectiveness of the ESM-2 model in accurately predicting O-GlcNAc sites within human proteins. Accurately predicting O-GlcNAc sites within human proteins can significantly advance glycoproteomic research by enhancing our understanding of protein function and disease mechanisms, aiding in developing targeted therapies, and facilitating biomarker discovery for improved diagnosis and treatment. Furthermore, future studies should focus on more diverse data types, longer protein sequence lengths, and higher computational resources to evaluate various parameters. Accurate prediction of O-GlcNAc sites might enhance the investigation of the site-specific functions of proteins in physiology and diseases.
Collapse
Affiliation(s)
- Ayesha Khalid
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Afshan Kaleem
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Wajahat Qazi
- Department of Computer Science, COMSATS University, Islamabad, Pakistan
| | - Roheena Abdullah
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Mehwish Iqtedar
- Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
| | - Shagufta Naz
- Department of Zoology, Lahore College for Women University, Lahore, Pakistan
| |
Collapse
|
4
|
Yom A, Chiang A, Lewis NE. Boltzmann Model Predicts Glycan Structures from Lectin Binding. Anal Chem 2024; 96:8332-8341. [PMID: 38720429 PMCID: PMC11162346 DOI: 10.1021/acs.analchem.3c04992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2024]
Abstract
Glycans are complex oligosaccharides that are involved in many diseases and biological processes. Unfortunately, current methods for determining glycan composition and structure (glycan sequencing) are laborious and require a high level of expertise. Here, we assess the feasibility of sequencing glycans based on their lectin binding fingerprints. By training a Boltzmann model on lectin binding data, we predict the approximate structures of 88 ± 7% of N-glycans and 87 ± 13% of O-glycans in our test set. We show that our model generalizes well to the pharmaceutically relevant case of Chinese hamster ovary (CHO) cell glycans. We also analyze the motif specificity of a wide array of lectins and identify the most and least predictive lectins and glycan features. These results could help streamline glycoprotein research and be of use to anyone using lectins for glycobiology.
Collapse
Affiliation(s)
- Aria Yom
- Department of Physics, University of California, San Diego, California 92093, United States
| | - Austin Chiang
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Immunology Center of Georgia, Augusta University, Augusta, Georgia 30912, United States
- Department of Medicine, Augusta University, Augusta, Georgia 30912, United States
| | - Nathan E Lewis
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Department of Bioengineering, University of California, San Diego, California 92093, United States
| |
Collapse
|
5
|
Wang SH, Zhao Y, Wang CC, Chu F, Miao LY, Zhang L, Zhuo L, Chen X. RFEM: A framework for essential microRNA identification in mice based on rotation forest and multiple feature fusion. Comput Biol Med 2024; 171:108177. [PMID: 38422957 DOI: 10.1016/j.compbiomed.2024.108177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 01/21/2024] [Accepted: 02/18/2024] [Indexed: 03/02/2024]
Abstract
With the increasing number of microRNAs (miRNAs), identifying essential miRNAs has become an important task that needs to be solved urgently. However, there are few computational methods for essential miRNA identification. Here, we proposed a novel framework called Rotation Forest for Essential MicroRNA identification (RFEM) to predict the essentiality of miRNAs in mice. We first constructed 1,264 miRNA features of all miRNA samples by fusing 38 miRNA features obtained from the PESM paper and 1,226 miRNA functional features calculated based on miRNA-target gene interactions. Then, we employed 182 training samples with 1,264 features to train the rotation forest model, which was applied to compute the essentiality scores of the candidate samples. The main innovations of RFEM were as follows: 1) miRNA functional features were introduced to enrich the diversity of miRNA features; 2) the rotation forest model used decision tree as the base classifier and could increase the difference among base classifiers through feature transformation to achieve better ensemble results. Experimental results show that RFEM significantly outperformed two previous models with the AUC (AUPR) of 0.942 (0.944) in three comparison experiments under 5-fold cross validation, which proved the model's reliable performance. Moreover, ablation study was further conducted to demonstrate the effectiveness of the novel miRNA functional features. Additionally, in the case studies of assessing the essentiality of unlabeled miRNAs, experimental literature confirmed that 7 of the top 10 predicted miRNAs have crucial biological functions in mice. Therefore, RFEM would be a reliable tool for identifying essential miRNAs.
Collapse
Affiliation(s)
- Shu-Hao Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China; Artificial Intelligence Research Institute, China University of Mining and Technology, Xuzhou, 221116, China
| | - Yan Zhao
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Chun-Chun Wang
- School of Science, Jiangnan University, Wuxi, 214122, China
| | - Fei Chu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China; Artificial Intelligence Research Institute, China University of Mining and Technology, Xuzhou, 221116, China
| | - Lian-Ying Miao
- School of Mathematics, China University of Mining and Technology, Xuzhou, 221116, China
| | - Li Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, China.
| | - Xing Chen
- School of Science, Jiangnan University, Wuxi, 214122, China.
| |
Collapse
|
6
|
Zuo Y, Zhang J, He W, Liu X, Deng Z. CarSitePred: an integrated algorithm for identifying carbonylated sites based on KNDUA-LNDOT resampling technique. J Biomol Struct Dyn 2024:1-13. [PMID: 38334134 DOI: 10.1080/07391102.2024.2313712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 01/27/2024] [Indexed: 02/10/2024]
Abstract
Carbonylated sites are the determining factors for functional changes or deletions in carbonylated proteins, so identifying carbonylated sites is essential for understanding the process of protein carbonylated and exploring the pathogenesis of related diseases. The current wet experimental methods for predicting carbonylated modification sites ae not only expensive and time-consuming, but also have limited protein processing capabilities and cannot meet the needs of researchers. The identification of carbonylated sites using computational methods not only improves the functional characterization of proteins, but also provides researchers with free tools for predicting carbonylated sites. Therefore, it is essential to establish a model using computational methods that can accurately predict protein carbonylated sites. In this study, a prediction model, CarSitePred, is proposed to identify carbonylation sites. In CarSitePred, specific location amino acid hydrophobic hydrophilic, one-to-one numerical conversion of amino acids, and AlexNet convolutional neural networks convert preprocessed carbonylated sequences into valid numerical features. The K-means Normal Distribution-based Undersampling Algorithm (KNDUA) and Localized Normal Distribution Oversampling Technology (LNDOT) were firstly proposed and employed to balance the K, P, R and T carbonylation training dataset. And for the first time, carbonylation modification sites were transformed into the form of images and directly inputted into AlexNet convolutional neural network to extract features for fitting SVM classifiers. The 10-fold cross-validation and independent testing results show that CarSitePred achieves better prediction performance than the best currently available prediction models. Availability: https://github.com/zuoyun123/CarSitePred.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Jingrun Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xiangrong Liu
- Department of Computer Science, Xiamen University, Xiamen, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| |
Collapse
|
7
|
Hu F, Li W, Li Y, Hou C, Ma J, Jia C. O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning. J Proteome Res 2024; 23:95-106. [PMID: 38054441 DOI: 10.1021/acs.jproteome.3c00458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
O-linked β-N-acetylglucosamine (O-GlcNAc) is a post-translational modification (i.e., O-GlcNAcylation) on serine/threonine residues of proteins, regulating a plethora of physiological and pathological events. As a dynamic process, O-GlcNAc functions in a site-specific manner. However, the experimental identification of the O-GlcNAc sites remains challenging in many scenarios. Herein, by leveraging the recent progress in cataloguing experimentally identified O-GlcNAc sites and advanced deep learning approaches, we establish an ensemble model, O-GlcNAcPRED-DL, a deep learning-based tool, for the prediction of O-GlcNAc sites. In brief, to make a benchmark O-GlcNAc data set, we extracted the information on O-GlcNAc from the recently constructed database O-GlcNAcAtlas, which contains thousands of experimentally identified and curated O-GlcNAc sites on proteins from multiple species. To overcome the imbalance between positive and negative data sets, we selected five groups of negative data sets in humans and mice to construct an ensemble predictor based on connection of a convolutional neural network and bidirectional long short-term memory. By taking into account three types of sequence information, we constructed four network frameworks, with the systematically optimized parameters used for the models. The thorough comparison analysis on two independent data sets of humans and mice and six independent data sets from other species demonstrated remarkably increased sensitivity and accuracy of the O-GlcNAcPRED-DL models, outperforming other existing tools. Moreover, a user-friendly Web server for O-GlcNAcPRED-DL has been constructed, which is freely available at http://oglcnac.org/pred_dl.
Collapse
Affiliation(s)
- Fengzhu Hu
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Weiyu Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Yaoxiang Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Chunyan Hou
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Junfeng Ma
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
8
|
Pokharel S, Pratyush P, Ismail HD, Ma J, KC DB. Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction. Int J Mol Sci 2023; 24:16000. [PMID: 37958983 PMCID: PMC10650050 DOI: 10.3390/ijms242116000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 11/02/2023] [Accepted: 11/04/2023] [Indexed: 11/15/2023] Open
Abstract
O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site's web server and source code are publicly available to the community.
Collapse
Affiliation(s)
- Suresh Pokharel
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA; (S.P.); (P.P.); (H.D.I.)
| | - Pawel Pratyush
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA; (S.P.); (P.P.); (H.D.I.)
| | - Hamid D. Ismail
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA; (S.P.); (P.P.); (H.D.I.)
| | - Junfeng Ma
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington, DC 20057, USA;
| | - Dukka B. KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA; (S.P.); (P.P.); (H.D.I.)
| |
Collapse
|
9
|
Kumari S, Gupta R, Ambasta RK, Kumar P. Emerging trends in post-translational modification: Shedding light on Glioblastoma multiforme. Biochim Biophys Acta Rev Cancer 2023; 1878:188999. [PMID: 37858622 DOI: 10.1016/j.bbcan.2023.188999] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 10/06/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023]
Abstract
Recent multi-omics studies, including proteomics, transcriptomics, genomics, and metabolomics have revealed the critical role of post-translational modifications (PTMs) in the progression and pathogenesis of Glioblastoma multiforme (GBM). Further, PTMs alter the oncogenic signaling events and offer a novel avenue in GBM therapeutics research through PTM enzymes as potential biomarkers for drug targeting. In addition, PTMs are critical regulators of chromatin architecture, gene expression, and tumor microenvironment (TME), that play a crucial function in tumorigenesis. Moreover, the implementation of artificial intelligence and machine learning algorithms enhances GBM therapeutics research through the identification of novel PTM enzymes and residues. Herein, we briefly explain the mechanism of protein modifications in GBM etiology, and in altering the biologics of GBM cells through chromatin remodeling, modulation of the TME, and signaling pathways. In addition, we highlighted the importance of PTM enzymes as therapeutic biomarkers and the role of artificial intelligence and machine learning in protein PTM prediction.
Collapse
Affiliation(s)
- Smita Kumari
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India
| | - Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; School of Medicine, University of South Carolina, Columbia, SC, United States of America
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India; Department of Biotechnology and Microbiology, SRM University, Sonepat, Haryana, India.
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological, University, India.
| |
Collapse
|
10
|
Chen XG, Yang X, Li C, Lin X, Zhang W. Non-coding RNA identification with pseudo RNA sequences and feature representation learning. Comput Biol Med 2023; 165:107355. [PMID: 37639767 DOI: 10.1016/j.compbiomed.2023.107355] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/16/2023] [Accepted: 08/12/2023] [Indexed: 08/31/2023]
Abstract
Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Chenhong Li
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xianguang Lin
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
11
|
Wang C, Yang Q. ScerePhoSite: An interpretable method for identifying fungal phosphorylation sites in proteins using sequence-based features. Comput Biol Med 2023; 158:106798. [PMID: 36966555 DOI: 10.1016/j.compbiomed.2023.106798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/03/2023] [Accepted: 03/20/2023] [Indexed: 03/31/2023]
Abstract
Protein phosphorylation plays a vital role in signal transduction pathways and diverse cellular processes. To date, a tremendous number of in silico tools have been designed for phosphorylation site identification, but few of them are suitable for the identification of fungal phosphorylation sites. This largely hampers the functional investigation of fungal phosphorylation. In this paper, we present ScerePhoSite, a machine learning method for fungal phosphorylation site identification. The sequence fragments are represented by hybrid physicochemical features, and then LGB-based feature importance combined with the sequential forward search method is used to choose the optimal feature subset. As a result, ScerePhoSite surpasses current available tools and shown a more robust and balanced performance. Furthermore, the impact and contribution of specific features on the model performance were investigated by SHAP values. We expect ScerePhoSite to be a useful bioinformatics tool that complements hands-on experiments for the pre-screening of possible phosphorylation sites and facilitates our functional understanding of phosphorylation modification in fungi. The source code and datasets are accessible at https://github.com/wangchao-malab/ScerePhoSite/.
Collapse
|
12
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
13
|
He Z, Lin Y, Wei R, Liu C, Jiang D. Repulsion and attraction in searching: A hybrid algorithm based on gravitational kernel and vital few for cancer driver gene prediction. Comput Biol Med 2022; 151:106236. [PMID: 36370584 DOI: 10.1016/j.compbiomed.2022.106236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 10/15/2022] [Accepted: 10/22/2022] [Indexed: 12/27/2022]
Abstract
By taking a new perspective to combine a machine learning method with an evolutionary algorithm, a new hybrid algorithm is developed to predict cancer driver genes. Firstly, inspired by the search strategy with the capability of global search in evolutionary algorithms, a gravitational kernel is proposed to act on the full range of gene features. Constructed by fusing PPI and mutation features, the gravitational kernel is capable to produce repulsion effects. The candidate genes with greater mutation effects and PPI have higher similarity scores. According to repulsion, the similarity score of these promising genes is larger than ordinary genes, which is beneficial to search for these promising genes. Secondly, inspired by the idea of elite populations related to evolutionary algorithms, the concept of vital few is proposed. Targeted at a local scale, it acts on the candidate genes associated with vital few genes. Under attraction effect, these vital few driver genes attract those with similar mutational effects to them, which leads to greater similarity scores. Lastly, the model and parameters are optimized by using an evolutionary algorithm, so as to obtain the optimal model and parameters for cancer driver gene prediction. Herein, a comparison is performed with six other advanced methods of cancer driver gene prediction. According to the experimental results, the method proposed in this study outperforms these six state-of-the-art algorithms on the pan-oncogene dataset.
Collapse
Affiliation(s)
- Zhihui He
- Department of Computer Science, Shantou University, 515063, China
| | - Yingqing Lin
- Department of Computer Science, Shantou University, 515063, China
| | - Runguo Wei
- Department of Computer Science, Shantou University, 515063, China
| | - Cheng Liu
- Department of Computer Science, Shantou University, 515063, China
| | - Dazhi Jiang
- Department of Computer Science, Shantou University, 515063, China; Guangdong Provincial Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510399, China.
| |
Collapse
|
14
|
Li J, Guo B, Zhang W, Yue S, Huang S, Gao S, Ma J, Cipollo JF, Yang S. Recent advances in demystifying O-glycosylation in health and disease. Proteomics 2022; 22:e2200156. [PMID: 36088641 DOI: 10.1002/pmic.202200156] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Revised: 08/30/2022] [Accepted: 09/02/2022] [Indexed: 11/09/2022]
Abstract
O-Glycosylation is one of the most common protein post-translational modifications (PTM) and plays an essential role in the pathophysiology of diseases. However, the complexity of O-glycosylation and the lack of specific enzymes for the processing of O-glycans and their O-glycopeptides make O-glycosylation analysis challenging. Recently, research on O-glycosylation has received attention owing to technological innovation and emerging O-glycoproteases. Several serine/threonine endoproteases have been found to specifically cleave O-glycosylated serine or threonine, allowing for the systematic analysis of O-glycoproteins. In this review, we first assessed the field of protein O-glycosylation over the past decade and used bibliometric analysis to identify keywords and emerging trends. We then summarized recent advances in O-glycosylation, covering several aspects: O-glycan release, site-specific elucidation of intact O-glycopeptides, identification of O-glycosites, characterization of different O-glycoproteases, mass spectrometry (MS) fragmentation methods for site-specific O-glycosylation assignment, and O-glycosylation data analysis. Finally, the role of O-glycosylation in health and disease was discussed.
Collapse
Affiliation(s)
- Jiajia Li
- Center for Clinical Mass Spectrometry, College of Pharmaceutical Sciences, Soochow University, Suzhou, Jiangsu, China
| | - Bo Guo
- Jiangsu Key Laboratory of Marine Pharmaceutical Compound Screening, Jiangsu Key Laboratory of Marine Biological Resources and Environment, Co-Innovation Center of Jiangsu Marine Bio-industry Technology, School of Pharmacy, Jiangsu Ocean University, Lianyungang, China
| | - Wenqi Zhang
- Center for Clinical Mass Spectrometry, College of Pharmaceutical Sciences, Soochow University, Suzhou, Jiangsu, China
| | - Shuang Yue
- Center for Clinical Mass Spectrometry, College of Pharmaceutical Sciences, Soochow University, Suzhou, Jiangsu, China
| | - Shan Huang
- Center for Clinical Mass Spectrometry, College of Pharmaceutical Sciences, Soochow University, Suzhou, Jiangsu, China
| | - Song Gao
- Jiangsu Key Laboratory of Marine Pharmaceutical Compound Screening, Jiangsu Key Laboratory of Marine Biological Resources and Environment, Co-Innovation Center of Jiangsu Marine Bio-industry Technology, School of Pharmacy, Jiangsu Ocean University, Lianyungang, China
| | - Junfeng Ma
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington, DC, USA
| | - John F Cipollo
- Laboratory of Bacterial Polysaccharides, Division of Bacterial, Parasitic and Allergenic Products, Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, Maryland, USA
| | - Shuang Yang
- Center for Clinical Mass Spectrometry, College of Pharmaceutical Sciences, Soochow University, Suzhou, Jiangsu, China
| |
Collapse
|
15
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
16
|
Yang Q, Li B, Wang P, Xie J, Feng Y, Liu Z, Zhu F. LargeMetabo: an out-of-the-box tool for processing and analyzing large-scale metabolomic data. Brief Bioinform 2022; 23:bbac455. [PMID: 36274234 DOI: 10.1093/bib/bbac455] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Revised: 09/06/2022] [Accepted: 09/24/2022] [Indexed: 12/14/2022] Open
Abstract
Large-scale metabolomics is a powerful technique that has attracted widespread attention in biomedical studies focused on identifying biomarkers and interpreting the mechanisms of complex diseases. Despite a rapid increase in the number of large-scale metabolomic studies, the analysis of metabolomic data remains a key challenge. Specifically, diverse unwanted variations and batch effects in processing many samples have a substantial impact on identifying true biological markers, and it is a daunting challenge to annotate a plethora of peaks as metabolites in untargeted mass spectrometry-based metabolomics. Therefore, the development of an out-of-the-box tool is urgently needed to realize data integration and to accurately annotate metabolites with enhanced functions. In this study, the LargeMetabo package based on R code was developed for processing and analyzing large-scale metabolomic data. This package is unique because it is capable of (1) integrating multiple analytical experiments to effectively boost the power of statistical analysis; (2) selecting the appropriate biomarker identification method by intelligent assessment for large-scale metabolic data and (3) providing metabolite annotation and enrichment analysis based on an enhanced metabolite database. The LargeMetabo package can facilitate flexibility and reproducibility in large-scale metabolomics. The package is freely available from https://github.com/LargeMetabo/LargeMetabo.
Collapse
Affiliation(s)
- Qingxia Yang
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, Chongqing 401331, China
| | - Panpan Wang
- College of Chemistry and Pharmaceutical Engineering, Huanghuai University, Zhumadian 463000, China
| | - Jicheng Xie
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Yuhao Feng
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Ziqiang Liu
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| |
Collapse
|
17
|
Abstract
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
Collapse
Affiliation(s)
- Daniel Bojar
- Department
of Chemistry and Molecular Biology, University
of Gothenburg, Gothenburg 41390, Sweden
- Wallenberg
Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden
| | - Frederique Lisacek
- Proteome
Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
- Computer
Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
| |
Collapse
|
18
|
Akmal MA, Hassan MA, Muhammad S, Khurshid KS, Mohamed A. An analytical study on the identification of N-linked glycosylation sites using machine learning model. PeerJ Comput Sci 2022; 8:e1069. [PMID: 36262138 PMCID: PMC9575850 DOI: 10.7717/peerj-cs.1069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 07/25/2022] [Indexed: 06/16/2023]
Abstract
N-linked is the most common type of glycosylation which plays a significant role in identifying various diseases such as type I diabetes and cancer and helps in drug development. Most of the proteins cannot perform their biological and psychological functionalities without undergoing such modification. Therefore, it is essential to identify such sites by computational techniques because of experimental limitations. This study aims to analyze and synthesize the progress to discover N-linked places using machine learning methods. It also explores the performance of currently available tools to predict such sites. Almost seventy research articles published in recognized journals of the N-linked glycosylation field have shortlisted after the rigorous filtering process. The findings of the studies have been reported based on multiple aspects: publication channel, feature set construction method, training algorithm, and performance evaluation. Moreover, a literature survey has developed a taxonomy of N-linked sequence identification. Our study focuses on the performance evaluation criteria, and the importance of N-linked glycosylation motivates us to discover resources that use computational methods instead of the experimental method due to its limitations.
Collapse
Affiliation(s)
- Muhammad Aizaz Akmal
- Department of Computer Science, University of Engineering and Technology, KSK, Lahore, Punjab, Pakistan
| | - Muhammad Awais Hassan
- Department of Computer Science, University of Engineering and Technology, Lahore, Punjab, Pakistan
| | - Shoaib Muhammad
- Department of Computer Science, University of Engineering and Technology, Lahore, Punjab, Pakistan
| | - Khaldoon S. Khurshid
- Department of Computer Science, University of Engineering and Technology, Lahore, Punjab, Pakistan
| | | |
Collapse
|
19
|
Dupas T, Betus C, Blangy-Letheule A, Pelé T, Persello A, Denis M, Lauzier B. An overview of tools to decipher O-GlcNAcylation from historical approaches to new insights. Int J Biochem Cell Biol 2022; 151:106289. [PMID: 36031106 DOI: 10.1016/j.biocel.2022.106289] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 08/21/2022] [Accepted: 08/23/2022] [Indexed: 11/19/2022]
Abstract
O-GlcNAcylation is a post-translational modification which affects approximately 5000 human proteins. Its involvement has been shown in many if not all biological processes. Variations in O-GlcNAcylation levels can be associated with the development of diseases. Deciphering the role of O-GlcNAcylation is an important issue to (i) understand its involvement in pathophysiological development and (ii) develop new therapeutic strategies to modulate O-GlcNAc levels. Over the past 30 years, despite the development of several approaches, knowledge of its role and regulation have remained limited. This review proposes an overview of the currently available tools to study O-GlcNAcylation and identify O-GlcNAcylated proteins. Briefly, we discuss pharmacological modulators, methods to study O-GlcNAcylation levels and approaches for O-GlcNAcylomic profiling. This review aims to contribute to a better understanding of the methods used to study O-GlcNAcylation and to promote efforts in the development of new strategies to explore this promising modification.
Collapse
Affiliation(s)
- Thomas Dupas
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France.
| | - Charlotte Betus
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France; Department of Pharmacology and Physiology, University of Montreal, Montreal, QC H3T 1C5, Canada; CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | | | - Thomas Pelé
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France
| | - Antoine Persello
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France
| | - Manon Denis
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France; Department of Pharmacology and Physiology, University of Montreal, Montreal, QC H3T 1C5, Canada; CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Benjamin Lauzier
- Nantes Université, CHU Nantes, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France
| |
Collapse
|
20
|
Zuo Y, Hong Y, Zeng X, Zhang Q, Liu X. MLysPRED: graph-based multi-view clustering and multi-dimensional normal distribution resampling techniques to predict multiple lysine sites. Brief Bioinform 2022; 23:6661182. [PMID: 35953081 DOI: 10.1093/bib/bbac277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 06/11/2022] [Accepted: 06/14/2022] [Indexed: 11/13/2022] Open
Abstract
Posttranslational modification of lysine residues, K-PTM, is one of the most popular PTMs. Some lysine residues in proteins can be continuously or cascaded covalently modified, such as acetylation, crotonylation, methylation and succinylation modification. The covalent modification of lysine residues may have some special functions in basic research and drug development. Although many computational methods have been developed to predict lysine PTMs, up to now, the K-PTM prediction methods have been modeled and learned a single class of K-PTM modification. In view of this, this study aims to fill this gap by building a multi-label computational model that can be directly used to predict multiple K-PTMs in proteins. In this study, a multi-label prediction model, MLysPRED, is proposed to identify multiple lysine sites using features generated from human protein sequences. In MLysPRED, three kinds of multi-label sequence encoding algorithms (MLDBPB, MLPSDAAP, MLPSTAAP) are proposed and combined with three encoding strategies (CHHAA, DR and Kmer) to convert preprocessed lysine sequences into effective numerical features. A multidimensional normal distribution oversampling technique and graph-based multi-view clustering under-sampling algorithm were first proposed and incorporated to reduce the proportion of the original training samples, and multi-label nearest neighbor algorithm is used for classification. It is observed that MLysPRED achieved an Aiming of 92.21%, Coverage of 94.98%, Accuracy of 89.63%, Absolute-True of 81.46% and Absolute-False of 0.0682 on the independent datasets. Additionally, comparison of results with five existing predictors also indicated that MLysPRED is very promising and encouraging to predict multiple K-PTMs in proteins. For the convenience of the experimental scientists, 'MLysPRED' has been deployed as a user-friendly web-server at http://47.100.136.41:8181.
Collapse
Affiliation(s)
- Yun Zuo
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| | - Yue Hong
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology (DLUT), China
| | - Xiangrong Liu
- Department of Computer Science, Xiamen University, Xiamen 361005, China
| |
Collapse
|
21
|
MLapSVM-LBS: Predicting DNA-binding proteins via a multiple Laplacian regularized support vector machine with local behavior similarity. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
22
|
|
23
|
Li Z, Pan X, Cai YD. Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods. Front Bioeng Biotechnol 2022; 10:890901. [PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Yu-Dong Cai,
| |
Collapse
|
24
|
Zhang Y, Bao W, Cao Y, Cong H, Chen B, Chen Y. A survey on protein–DNA-binding sites in computational biology. Brief Funct Genomics 2022; 21:357-375. [DOI: 10.1093/bfgp/elac009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/07/2022] [Accepted: 04/22/2022] [Indexed: 01/08/2023] Open
Abstract
Abstract
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein–DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein–DNA-binding sites. In recent years, methods based on deep learning to predict protein–DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein–DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN–RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein–DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein–DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein–DNA-binding site prediction methods will help researchers better understand this field.
Collapse
|
25
|
Chen XG, Liu S, Zhang W. Predicting Coding Potential of RNA Sequences by Solving Local Data Imbalance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1075-1083. [PMID: 32886613 DOI: 10.1109/tcbb.2020.3021800] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Non-coding RNAs (ncRNAs)play an important role in various biological processes and are associated with diseases. Distinguishing between coding RNAs and ncRNAs, also known as predicting coding potential of RNA sequences, is critical for downstream biological function analysis. Many machine learning-based methods have been proposed for predicting coding potential of RNA sequences. Recent studies reveal that most existing methods have poor performance on RNA sequences with short Open Reading Frames (sORF, ORF length<303nt). In this work, we analyze the distribution of ORF length of RNA sequences, and observe that the number of coding RNAs with sORF is inadequate and coding RNAs with sORF are much less than ncRNAs with sORF. Thus, there exists the problem of local data imbalance in RNA sequences with sORF. We propose a coding potential prediction method CPE-SLDI, which uses data oversampling techniques to augment samples for coding RNAs with sORF so as to alleviate local data imbalance. Compared with existing methods, CPE-SLDI produces the better performances, and studies reveal that data augmentation by various data oversampling techniques can enhance the performance of coding potential prediction, especially for RNA sequences with sORF. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPESLDI.
Collapse
|
26
|
Tools, tactics and objectives to interrogate cellular roles of O-GlcNAc in disease. Nat Chem Biol 2022; 18:8-17. [PMID: 34934185 PMCID: PMC8712397 DOI: 10.1038/s41589-021-00903-6] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Accepted: 09/21/2021] [Indexed: 12/15/2022]
Abstract
The vast array of cell types of multicellular organisms must individually fine-tune their internal metabolism. One important metabolic and stress regulatory mechanism is the dynamic attachment/removal of glucose-derived sugar N-acetylglucosamine on proteins (O-GlcNAcylation). The number of proteins modified by O-GlcNAc is bewildering, with at least 7,000 sites in human cells. The outstanding challenge is determining how key O-GlcNAc sites regulate a target pathway amidst thousands of potential global sites. Innovative solutions are required to address this challenge in cell models and disease therapy. This Perspective shares critical suggestions for the O-GlcNAc field gleaned from the international O-GlcNAc community. Further, we summarize critical tools and tactics to enable newcomers to O-GlcNAc biology to drive innovation at the interface of metabolism and disease. The growing pace of O-GlcNAc research makes this a timely juncture to involve a wide array of scientists and new toolmakers to selectively approach the regulatory roles of O-GlcNAc in disease.
Collapse
|
27
|
Guo X, Zhou W, Yu Y, Cai Y, Zhang Y, Du A, Lu Q, Ding Y, Li C. Multiple Laplacian Regularized RBF Neural Network for Assessing Dry Weight of Patients With End-Stage Renal Disease. Front Physiol 2021; 12:790086. [PMID: 34966294 PMCID: PMC8711098 DOI: 10.3389/fphys.2021.790086] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 11/17/2021] [Indexed: 11/28/2022] Open
Abstract
Dry weight (DW) is an important dialysis index for patients with end-stage renal disease. It can guide clinical hemodialysis. Brain natriuretic peptide, chest computed tomography image, ultrasound, and bioelectrical impedance analysis are key indicators (multisource information) for assessing DW. By these approaches, a trial-and-error method (traditional measurement method) is employed to assess DW. The assessment of clinician is time-consuming. In this study, we developed a method based on artificial intelligence technology to estimate patient DW. Based on the conventional radial basis function neural (RBFN) network, we propose a multiple Laplacian-regularized RBFN (MLapRBFN) model to predict DW of patient. Compared with other model and body composition monitor, our method achieves the lowest value (1.3226) of root mean square error. In Bland-Altman analysis of MLapRBFN, the number of out agreement interval is least (17 samples). MLapRBFN integrates multiple Laplace regularization terms, and employs an efficient iterative algorithm to solve the model. The ratio of out agreement interval is 3.57%, which is lower than 5%. Therefore, our method can be tentatively applied for clinical evaluation of DW in hemodialysis patients.
Collapse
Affiliation(s)
- Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Zhou
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yan Yu
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yinghua Cai
- Department of Nursing, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yuan Zhang
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Aiyan Du
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Qun Lu
- Department of Nursing, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Chao Li
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| |
Collapse
|
28
|
Liu Y, Jin S, Gao H, Wang X, Wang C, Zhou W, Yu B. Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier. Bioinformatics 2021; 38:1223-1230. [PMID: 34864897 PMCID: PMC8690230 DOI: 10.1093/bioinformatics/btab811] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 11/17/2021] [Accepted: 11/30/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Multi-label (ML) protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as coronavirus disease 2019 (COVID-19). RESULTS The article proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition, encoding based on grouped weight, gene ontology, multi-scale continuous and discontinuous, residue probing transformation and evolutionary distance transformation. In the next part, we utilize the ML information latent semantic index method to avoid the interference of redundant information. In the end, ML learning with feature-induced labeling information enrichment is adopted to predict the ML protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy of the first four datasets are 99.23%, 93.82%, 93.24% and 96.72% by the leave-one-out cross validation. It is worth mentioning that the overall actual accuracy prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of ML protein, which provides new ideas for further research on the SCL of ML protein. AVAILABILITY AND IMPLEMENTATION The source codes and datasets are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yushuang Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Xue Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Weifeng Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao 266061, China,College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China,To whom correspondence should be addressed.
| |
Collapse
|
29
|
Li J, He S, Guo F, Zou Q. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching. RNA Biol 2021; 18:1882-1892. [PMID: 33446014 PMCID: PMC8583144 DOI: 10.1080/15476286.2021.1875180] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 12/02/2020] [Accepted: 01/08/2021] [Indexed: 01/21/2023] Open
Abstract
Recent studies have shown that RNA methylation modification can affect RNA transcription, metabolism, splicing and stability. In addition, RNA methylation modification has been associated with cancer, obesity and other diseases. Based on information about human genome and machine learning, this paper discusses the effect of the fusion sequence and gene-level feature extraction on the accuracy of methylation site recognition. The significant limitation of existing computing tools was exposed by discovered of new features. (1) Most prediction models are based solely on sequence features and use SVM or random forest as classification methods. (2) Limited by the number of samples, the model may not achieve good performance. In order to establish a better prediction model for methylation sites, we must set specific weighting strategies for training samples and find more powerful and informative feature matrices to establish a comprehensive model. In this paper, we present HSM6AP, a high-precision predictor for the Homo sapiens N6-methyladenosine (m 6 A ) based on multiple weights and feature stitching. Compared with existing methods, HSM6AP samples were creatively weighted during training, and a wide range of features were explored. Max-Relevance-Max-Distance (MRMD) is employed for feature selection, and the feature matrix is generated by fusing a single feature. The extreme gradient boosting (XGBoost), an integrated machine learning algorithm based on decision tree, is used for model training and improves model performance through parameter adjustment. Two rigorous independent data sets demonstrated the superiority of HSM6AP in identifying methylation sites. HSM6AP is an advanced predictor that can be directly employed by users (especially non-professional users) to predict methylation sites. Users can access our related tools and data sets at the following website: http://lab.malab.cn/~lijing/HSM6AP.html The codes of our tool can be publicly accessible at https://github.com/lijingtju/HSm6AP.git.
Collapse
Affiliation(s)
- Jing Li
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shida He
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Bioinformatics Laboratory, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
30
|
Sohrawordi M, Hossain MA. Prediction of lysine formylation sites using support vector machine based on the sample selection from majority classes and synthetic minority over-sampling techniques. Biochimie 2021; 192:125-135. [PMID: 34627982 DOI: 10.1016/j.biochi.2021.10.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 10/03/2021] [Accepted: 10/05/2021] [Indexed: 12/22/2022]
Abstract
Lysine formylation is a newly discovered and mostly interested type of post-translational modification (PTM) that is generally found on core and linker histone proteins of prokaryote and eukaryote and plays various important roles on the regulation of various cellular mechanisms. Hence, it is very urgent to properly identify formylation site in protein for understanding the molecular mechanism of formylation deeply and defining drug for relevant diseases. As experimentally identification of formylation site using traditional processes are expensive and time consuming, a simple and high speedy mathematical model for predicting accurately lysine formylation sites is highly desired. A useful computational model named PLF_SVM is deigned and proposed in this study by using binary encoding (BE), amino acid composition (AAC), reverse position relative incidence matrix (RPRIM), position relative incidence matrix (PRIM), and position specific amino acid propensity (PSAAP) feature generation methods for predicting formylated and non-formylated lysine sites. Besides, the Synthetic Minority Oversampling Technique (SMOTE) and a proposed sample selection strategy named EnSVM are applied to handle the imbalance training dataset problem. Thereafter, the optimal number of features are selected by F-score method to train the model. Finally, it has been seen that PLF_SVM outperforms the state-of-the-art approaches in validation and independent test with an accuracy of 98.61% and 98.77% respectively. At https://plf-svm.herokuapp.com/, a user-friendly web tool is also created for identifying formylation sites. Therefore, the proposed method may be helpful guideline for the analysis and prediction of formylated lysine and knowing the process of cellular regulation.
Collapse
Affiliation(s)
- Md Sohrawordi
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh; Dept. of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh.
| | - Md Ali Hossain
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| |
Collapse
|
31
|
Liu T, Chen J, Zhang Q, Hippe K, Hunt C, Le T, Cao R, Tang H. The Development of Machine Learning Methods in discriminating Secretory Proteins of Malaria Parasite. Curr Med Chem 2021; 29:807-821. [PMID: 34636289 DOI: 10.2174/0929867328666211005140625] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/28/2021] [Accepted: 08/15/2021] [Indexed: 11/22/2022]
Abstract
Malaria caused by Plasmodium falciparum is one of the major infectious diseases in the world. It is essential to exploit an effective method to predict secretory proteins of malaria parasites to develop effective cures and treatment. Biochemical assays can provide details for accurate identification of the secretory proteins, but these methods are expensive and time-consuming. In this paper, we summarized the machine learning-based identification algorithms and compared the construction strategies between different computational methods. Also, we discussed the use of machine learning to improve the ability of algorithms to identify proteins secreted by malaria parasites.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Jiamao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University. United States
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University. United States
| | - Thu Le
- Department of Computer Science, Pacific Lutheran University. United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University. United States
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| |
Collapse
|
32
|
Zhang S, Shi H. iR5hmcSC: Identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning. Comput Biol Chem 2021; 95:107583. [PMID: 34562726 DOI: 10.1016/j.compbiolchem.2021.107583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 09/02/2021] [Accepted: 09/12/2021] [Indexed: 01/27/2023]
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China.
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China
| |
Collapse
|
33
|
Jia C, Zhang M, Fan C, Li F, Song J. Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1937-1945. [PMID: 31804942 DOI: 10.1109/tcbb.2019.2957758] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts.
Collapse
|
34
|
Islam MKB, Rahman J, Hasan MAM, Ahmad S. predForm-Site: Formylation site prediction by incorporating multiple features and resolving data imbalance. Comput Biol Chem 2021; 94:107553. [PMID: 34384997 DOI: 10.1016/j.compbiolchem.2021.107553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 06/22/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Formylation is one of the newly discovered post-translational modifications in lysine residue which is responsible for different kinds of diseases. In this work, a novel predictor, named predForm-Site, has been developed to predict formylation sites with higher accuracy. We have integrated multiple sequence features for developing a more informative representation of formylation sites. Moreover, decision function of the underlying classifier have been optimized on skewed formylation dataset during prediction model training for prediction quality improvement. On the dataset used by LFPred and Formator predictor, predForm-Site achieved 99.5% sensitivity, 99.8% specificity and 99.8% overall accuracy with AUC of 0.999 in the jackknife test. In the independent test, it has also achieved more than 97% sensitivity and 99% specificity. Similarly, in benchmarking with recent method CKSAAP_FormSite, the proposed predictor significantly outperformed in all the measures, particularly sensitivity by around 20%, specificity by nearly 30% and overall accuracy by more than 22%. These experimental results show that the proposed predForm-Site can be used as a complementary tool for the fast exploration of formylation sites. For convenience of the scientific community, predForm-Site has been deployed as an online tool, accessible at http://103.99.176.239:8080/predForm-Site.
Collapse
Affiliation(s)
- Md Khaled Ben Islam
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
| | - Julia Rahman
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh.
| | - Md Al Mehedi Hasan
- Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering, Rajshahi University, Rajshahi, Bangladesh
| |
Collapse
|
35
|
Mauri T, Menu-Bouaouiche L, Bardor M, Lefebvre T, Lensink MF, Brysbaert G. O-GlcNAcylation Prediction: An Unattained Objective. Adv Appl Bioinform Chem 2021; 14:87-102. [PMID: 34135600 PMCID: PMC8197665 DOI: 10.2147/aabc.s294867] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 04/28/2021] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND O-GlcNAcylation is an essential post-translational modification (PTM) in mammalian cells. It consists in the addition of a N-acetylglucosamine (GlcNAc) residue onto serines or threonines by an O-GlcNAc transferase (OGT). Inhibition of OGT is lethal, and misregulation of this PTM can lead to diverse pathologies including diabetes, Alzheimer's disease and cancers. Knowing the location of O-GlcNAcylation sites and the ability to accurately predict them is therefore of prime importance to a better understanding of this process and its related pathologies. PURPOSE Here, we present an evaluation of the current predictors of O-GlcNAcylation sites based on a newly built dataset and an investigation to improve predictions. METHODS Several datasets of experimentally proven O-GlcNAcylated sites were combined, and the resulting meta-dataset was used to evaluate three prediction tools. We further defined a set of new features following the analysis of the primary to tertiary structures of experimentally proven O-GlcNAcylated sites in order to improve predictions by the use of different types of machine learning techniques. RESULTS Our results show the failure of currently available algorithms to predict O-GlcNAcylated sites with a precision exceeding 9%. Our efforts to improve the precision with new features using machine learning techniques do succeed for equal proportions of O-GlcNAcylated and non-O-GlcNAcylated sites but fail like the other tools for real-life proportions where ~1.4% of S/T are O-GlcNAcylated. CONCLUSION Present-day algorithms for O-GlcNAcylation prediction narrowly outperform random prediction. The inclusion of additional features, in combination with machine learning algorithms, does not enhance these predictions, emphasizing a pressing need for further development. We hypothesize that the improvement of prediction algorithms requires characterization of OGT's partners.
Collapse
Affiliation(s)
- Theo Mauri
- Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France
| | | | - Muriel Bardor
- Normandy University, UNIROUEN, Laboratoire Glyco-MEV EA4358, Rouen, 76000, France
| | - Tony Lefebvre
- Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France
| | - Marc F Lensink
- Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France
| | - Guillaume Brysbaert
- Univ. Lille, CNRS; UMR8576 - UGSF - Unité de Glycobiologie Structurale et Fonctionnelle, Lille, F-59000, France
| |
Collapse
|
36
|
Qian Y, Jiang L, Ding Y, Tang J, Guo F. A sequence-based multiple kernel model for identifying DNA-binding proteins. BMC Bioinformatics 2021; 22:291. [PMID: 34058979 PMCID: PMC8167993 DOI: 10.1186/s12859-020-03875-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Accepted: 11/13/2020] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND DNA-Binding Proteins (DBP) plays a pivotal role in biological system. A mounting number of researchers are studying the mechanism and detection methods. To detect DBP, the tradition experimental method is time-consuming and resource-consuming. In recent years, Machine Learning methods have been used to detect DBP. However, it is difficult to adequately describe the information of proteins in predicting DNA-binding proteins. In this study, we extract six features from protein sequence and use Multiple Kernel Learning-based on Centered Kernel Alignment to integrate these features. The integrated feature is fed into Support Vector Machine to build predictive model and detect new DBP. RESULTS In our work, date sets of PDB1075 and PDB186 are employed to test our method. From the results, our model obtains better results (accuracy) than other existing methods on PDB1075 ([Formula: see text]) and PDB186 ([Formula: see text]), respectively. CONCLUSION Multiple kernel learning could fuse the complementary information between different features. Compared with existing methods, our method achieves comparable and best results on benchmark data sets.
Collapse
Affiliation(s)
- Yuqing Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, People's Republic of China
| | - Limin Jiang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, People's Republic of China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, People's Republic of China.
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, People's Republic of China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, People's Republic of China.
| |
Collapse
|
37
|
Hu J, Zheng LL, Bai YS, Zhang KW, Yu DJ, Zhang GJ. Accurate prediction of protein-ATP binding residues using position-specific frequency matrix. Anal Biochem 2021; 626:114241. [PMID: 33971164 DOI: 10.1016/j.ab.2021.114241] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 04/27/2021] [Accepted: 05/01/2021] [Indexed: 10/21/2022]
Abstract
Knowledge of protein-ATP interaction can help for protein functional annotation and drug discovery. Accurately identifying protein-ATP binding residues is an important but challenging task to gain the knowledge of protein-ATP interactions, especially for the case where only protein sequence information is given. In this study, we propose a novel method, named DeepATPseq, to predict protein-ATP binding residues without using any information about protein three-dimension structure or sequence-derived structural information. In DeepATPseq, the HHBlits-generated position-specific frequency matrix (PSFM) profile is first employed to extract the feature information of each residue. Then, for each residue, the PSFM-based feature is fed into two prediction models, which are generated by the algorithms of deep convolutional neural network (DCNN) and support vector machine (SVM) separately. The final ATP-binding probability of the corresponding residue is calculated by the weighted sum of the outputted values of DCNN-based and SVM-based models. Experimental results on the independent validation data set demonstrate that DeepATPseq could achieve an accuracy of 77.71%, covering 57.42% of all ATP-binding residues, while achieving a Matthew's correlation coefficient value (0.655) that is significantly higher than that of existing sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis show that the major advantage of DeepATPseq lies at the combination utilization of DCNN and SVM that helps dig out more discriminative information from the PSFM profiles. The online server and standalone package of DeepATPseq are freely available at: https://jun-csbio.github.io/DeepATPseq/for academic use.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| | - Lin-Lin Zheng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Yan-Song Bai
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ke-Wen Zhang
- College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology,Xiaolingwei 200, Nanjing, 210094, China.
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| |
Collapse
|
38
|
Zou Y, Wu H, Guo X, Peng L, Ding Y, Tang J, Guo F. MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200607173829] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Detecting DNA-binding proteins (DBPs) based on biological and chemical
methods is time-consuming and expensive.
Objective:
In recent years, the rise of computational biology methods based on Machine Learning
(ML) has greatly improved the detection efficiency of DBPs.
Method:
In this study, the Multiple Kernel-based Fuzzy SVM Model with Support Vector Data
Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted
from the protein sequence. Secondly, multiple kernels are constructed via these sequence features.
Then, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel
Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with
Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs.
Results:
Our model is evaluated on several benchmark datasets. Compared with other methods, MKFSVM-
SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and
PDB2272 (0.5476).
Conclusion:
We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the
classifier for DNA-binding proteins identification.
Collapse
Affiliation(s)
- Yi Zou
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, No. 1 Kerui Road, 215009, Suzhou, China
| | - Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Li Peng
- School of Internet of Things Engineering, Jiangnan University, Wuxi, 214122, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, No. 1 Kerui Road, 215009, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| |
Collapse
|
39
|
Guo X, Zhou W, Shi B, Wang X, Du A, Ding Y, Tang J, Guo F. An Efficient Multiple Kernel Support Vector Regression Model for Assessing Dry Weight of Hemodialysis Patients. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200614172536] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Dry Weight (DW) is the lowest weight after dialysis, and patients with
lower weight usually have symptoms of hypotension and shock. Several clinical-based approaches
have been presented to assess the dry weight of hemodialysis patients. However, these traditional
methods all depend on special instruments and professional technicians.
Objective:
In order to avoid this limitation, we need to find a machine-independent way to assess dry
weight, therefore we collected some clinical influencing characteristic data and constructed a
Machine Learning-based (ML) model to predict the dry weight of hemodialysis patients.
Methods::
In this paper, 476 hemodialysis patients' demographic data, anthropometric measurements,
and Bioimpedance spectroscopy (BIS) were collected. Among them, these patients' age, sex, Body
Mass Index (BMI), Blood Pressure (BP) and Heart Rate (HR) and Years of Dialysis (YD) were
closely related to their dry weight. All these relevant data were used to enter the regression equation.
Multiple Kernel Support Vector Regression-based on Maximizes the Average Similarity (MKSVRMAS)
model was proposed to predict the dry weight of hemodialysis patients.
Result:
The experimental results show that dry weight is positively correlated with BMI and HR.
And age, sex, systolic blood pressure, diastolic blood pressure and hemodialysis time are negatively
correlated with dry weight. Moreover, the Root Mean Square Error (RMSE) of our model was
1.3817.
Conclusion:
Our proposed model could serve as a viable alternative for dry weight estimation of
hemodialysis patients, thus providing a new way for clinical practice. Our proposed model could serve as a viable alternative of dry weight estimation for hemodialysis patients,
thus providing a new way for the clinic.
Collapse
Affiliation(s)
- Xiaoyi Guo
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Wei Zhou
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Bin Shi
- Hemodialysis Center, Northern Jiangsu People's Hospital, 225001, Yangzhou, China
| | - Xiaohua Wang
- Department of Urology, the First Affiliated Hospital of Soochow University, 215006, Suzhou, China
| | - Aiyan Du
- Hemodialysis Center, The Affiliated Wuxi People's Hospital of Nanjing Medical University, 214000, Wuxi, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, 215009, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, 300350, Tianjin, China
| |
Collapse
|
40
|
Dou L, Yang F, Xu L, Zou Q. A comprehensive review of the imbalance classification of protein post-translational modifications. Brief Bioinform 2021; 22:6217722. [PMID: 33834199 DOI: 10.1093/bib/bbab089] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 02/17/2021] [Accepted: 02/24/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modifications (PTMs) play significant roles in regulating protein structure, activity and function, and they are closely involved in various pathologies. Therefore, the identification of associated PTMs is the foundation of in-depth research on related biological mechanisms, disease treatments and drug design. Due to the high cost and time consumption of high-throughput sequencing techniques, developing machine learning-based predictors has been considered an effective approach to rapidly recognize potential modified sites. However, the imbalanced distribution of true and false PTM sites, namely, the data imbalance problem, largely effects the reliability and application of prediction tools. In this article, we conduct a systematic survey of the research progress in the imbalanced PTMs classification. First, we describe the modeling process in detail and outline useful data imbalance solutions. Then, we summarize the recently proposed bioinformatics tools based on imbalanced PTM data and simultaneously build a convenient website, ImClassi_PTMs (available at lab.malab.cn/∼dlj/ImbClassi_PTMs/), to facilitate the researchers to view. Moreover, we analyze the challenges of current computational predictors and propose some suggestions to improve the efficiency of imbalance learning. We hope that this work will provide comprehensive knowledge of imbalanced PTM recognition and contribute to advanced predictors in the future.
Collapse
Affiliation(s)
- Lijun Dou
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Fenglong Yang
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
41
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
42
|
He S, Guo F, Zou Q, HuiDing. MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200503030350] [Citation(s) in RCA: 101] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
The study aims to find a way to reduce the dimensionality of the dataset.
Background:
Dimensionality reduction is the key issue of the machine learning process. It does
not only improve the prediction performance but also could recommend the intrinsic features and
help to explore the biological expression of the machine learning “black box”.
Objective:
A variety of feature selection algorithms are used to select data features to achieve
dimensionality reduction.
Methods:
First, MRMD2.0 integrated 7 different popular feature ranking algorithms with
PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy.
Result:
We have achieved good results in our experiments.
Conclusion:
Several works have been tested with MRMD2.0. It showed well performance.
Otherwise, it also can draw the performance curves according to the feature dimensionality. If
users want to sacrifice accuracy for fewer features, they can select the dimensionality from the
performance curves.
Other:
We developed friendly python tools together with the web server. The users could upload
their csv, arff or libsvm format files. Then the webserver would help to rank features and find the
optimized dimensionality.
Collapse
Affiliation(s)
- Shida He
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - HuiDing
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
43
|
Maynard JC, Chalkley RJ. Methods for Enrichment and Assignment of N-Acetylglucosamine Modification Sites. Mol Cell Proteomics 2021; 20:100031. [PMID: 32938750 PMCID: PMC8724609 DOI: 10.1074/mcp.r120.002206] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 08/27/2020] [Accepted: 09/16/2020] [Indexed: 12/21/2022] Open
Abstract
O-GlcNAcylation, the addition of a single N-acetylglucosamine residue to serine and threonine residues of cytoplasmic, nuclear, or mitochondrial proteins, is a widespread regulatory posttranslational modification. It is involved in the response to nutritional status and stress, and its dysregulation is associated with diseases ranging from Alzheimer's to diabetes. Although the modification was first detected over 35 years ago, research into the function of O-GlcNAcylation has accelerated dramatically in the last 10 years owing to the development of new enrichment and mass spectrometry techniques that facilitate its analysis. This article summarizes methods for O-GlcNAc enrichment, key mass spectrometry instrumentation advancements, particularly those that allow modification site localization, and software tools that allow analysis of data from O-GlcNAc-modified peptides.
Collapse
Affiliation(s)
- Jason C Maynard
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, USA
| | - Robert J Chalkley
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
44
|
Assessing Dry Weight of Hemodialysis Patients via Sparse Laplacian Regularized RVFL Neural Network with L 2,1-Norm. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6627650. [PMID: 33628794 PMCID: PMC7880720 DOI: 10.1155/2021/6627650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/01/2021] [Revised: 01/21/2021] [Accepted: 01/25/2021] [Indexed: 11/28/2022]
Abstract
Dry weight is the normal weight of hemodialysis patients after hemodialysis. If the amount of water in diabetes is too much (during hemodialysis), the patient will experience hypotension and shock symptoms. Therefore, the correct assessment of the patient's dry weight is clinically important. These methods all rely on professional instruments and technicians, which are time-consuming and labor-intensive. To avoid this limitation, we hope to use machine learning methods on patients. This study collected demographic and anthropometric data of 476 hemodialysis patients, including age, gender, blood pressure (BP), body mass index (BMI), years of dialysis (YD), and heart rate (HR). We propose a Sparse Laplacian regularized Random Vector Functional Link (SLapRVFL) neural network model on the basis of predecessors. When we evaluate the prediction performance of the model, we fully compare SLapRVFL with the Body Composition Monitor (BCM) instrument and other models. The Root Mean Square Error (RMSE) of SLapRVFL is 1.3136, which is better than other methods. The SLapRVFL neural network model could be a viable alternative of dry weight assessment.
Collapse
|
45
|
Ma J, Wu C, Hart GW. Analytical and Biochemical Perspectives of Protein O-GlcNAcylation. Chem Rev 2021; 121:1513-1581. [DOI: 10.1021/acs.chemrev.0c00884] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Junfeng Ma
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington D.C. 20057, United States
| | - Ci Wu
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington D.C. 20057, United States
| | - Gerald W. Hart
- Department of Biochemistry and Molecular Biology, Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States
| |
Collapse
|
46
|
Ning J, Yang H. O-GlcNAcylation in Hyperglycemic Pregnancies: Impact on Placental Function. Front Endocrinol (Lausanne) 2021; 12:659733. [PMID: 34140929 PMCID: PMC8204080 DOI: 10.3389/fendo.2021.659733] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 05/17/2021] [Indexed: 01/16/2023] Open
Abstract
The dynamic cycling of N-acetylglucosamine, termed as O-GlcNAcylation, is a post-translational modification of proteins and is involved in the regulation of fundamental cellular processes. It is controlled by two essential enzymes, O-GlcNAc transferase and O-GlcNAcase. O-GlcNAcylation serves as a modulator in placental tissue; furthermore, increased levels of protein O-GlcNAcylation have been observed in women with hyperglycemia during pregnancy, which may affect the short-and long-term development of offspring. In this review, we focus on the impact of O-GlcNAcylation on placental functions in hyperglycemia-associated pregnancies. We discuss the following topics: effect of O-GlcNAcylation on placental development and its association with hyperglycemia; maternal-fetal nutrition transport, particularly glucose transport, via the mammalian target of rapamycin and AMP-activated protein kinase pathways; and the two-sided regulatory effect of O-GlcNAcylation on inflammation. As O-GlcNAcylation in the placental tissues of pregnant women with hyperglycemia influences near- and long-term development of offspring, research in this field has significant therapeutic relevance.
Collapse
Affiliation(s)
- Jie Ning
- Department of Obstetrics and Gynaecology, Peking University First Hospital, Beijing, China
- Beijing Key Laboratory of Maternal Foetal Medicine of Gestational Diabetes Mellitus, Beijing, China
- Peking University, Beijing, China
| | - Huixia Yang
- Department of Obstetrics and Gynaecology, Peking University First Hospital, Beijing, China
- Beijing Key Laboratory of Maternal Foetal Medicine of Gestational Diabetes Mellitus, Beijing, China
- Peking University, Beijing, China
- *Correspondence: Huixia Yang,
| |
Collapse
|
47
|
Ding Y, Jiang L, Tang J, Guo F. Identification of human microRNA-disease association via hypergraph embedded bipartite local model. Comput Biol Chem 2020; 89:107369. [DOI: 10.1016/j.compbiolchem.2020.107369] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 08/03/2020] [Accepted: 08/31/2020] [Indexed: 12/16/2022]
|
48
|
Yang L, Gao H, Wu K, Zhang H, Li C, Tang L. Identification of Cancerlectins By Using Cascade Linear Discriminant Analysis and Optimal g-gap Tripeptide Composition. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190730103156] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Background:
Lectins are a diverse group of glycoproteins or glycoconjugate proteins
that can be extracted from plants, invertebrates and higher animals. Cancerlectins, a kind of lectins,
which play a key role in the process of tumor cells interacting with each other and are being employed
as therapeutic agents. A full understanding of cancerlectins is significant because it provides
a tool for the future direction of cancer therapy.
Objective:
To develop an accurate and practically useful timesaving tool to identify cancerlectins.
A novel sequence-based method is proposed along with a correlative webserver to access the proposed
tool.
Methods:
Firstly, protein features were extracted in a newly feature building way termed, g-gap
tripeptide composition. After which a proposed cascade linear discriminant analysis (Cascade
LDA) is used to alleviate the high dimensional difficulties with the Analysis Of Variance (ANOVA)
as a feature importance criterion. Finally, Support Vector Machine (SVM) is used as the classifier
to identify cancerlectins.
Results:
The proposed method achieved an accuracy of 91.34% with sensitivity of 89.89%, specificity
of 92.48% and an 0.8318 Mathew’s correlation coefficient based on only 13 fusion features
in jackknife cross validation, the result of which is superior to other published methods in this domain.
Conclusion:
In this study, a new method based only on primary structure of protein is proposed
and experimental results show that it could be a promising tool to identify cancerlectins. An openaccess
webserver is made available in this work to facilitate other related works.
Collapse
Affiliation(s)
- Liangwei Yang
- Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Gao
- Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Keyu Wu
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Haotian Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Changyu Li
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lixia Tang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
49
|
Wang C, Zhang H, Li Z, Zhou X, Cheng Y, Chen R. White Blood Cell Image Segmentation Based on Color Component Combination and Contour Fitting. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017102310] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
White Blood Cell (WBC) image segmentation plays a key role in cell
morphology analysis. However, WBC segmentation is still a challenging task due to the diversity
of WBCs under different staining conditions.
Objective:
In this paper, we propose a novel WBC segmentation method based on color component
combination and contour fitting to segment WBC images accurately.
Methods:
Specifically, the proposed method first uses color component combination and image
thresholding to achieve nucleus segmentation, then uses a color prior to remove image background,
and extracts the initial WBC contour via Canny edge detection, and finally judges and
closes the unclosed WBC contour by contour fitting. Accordingly, cytoplasm segmentation is
achieved by subtracting the nucleus region from the WBC region.
Results:
Experimental results on 100 WBC images under rapid staining condition and 50 WBC
images under standard staining condition showed that the proposed method improved segmentation
accuracy of white blood cells under rapid and standard staining conditions.
Conclusion:
The proposed color component combination and contour fitting is effective in WBC
segmentation task.
Collapse
Affiliation(s)
- Chuansheng Wang
- School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, China
| | - Hong Zhang
- School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, China
| | - Zuoyong Li
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou, China
| | - Xiaogen Zhou
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
| | - Yong Cheng
- School of Information Mechanical & Electrical Engineering, Jiangsu Open University, Nanjing, China
| | - Rongyan Chen
- Department of Clinical Laboratory, the People's Hospital Affiliated to Fujian University of Traditional Chinese Medicine, Fuzhou, China
| |
Collapse
|
50
|
Zhang Y, Yu S, Xie R, Li J, Leier A, Marquez-Lago TT, Akutsu T, Smith AI, Ge Z, Wang J, Lithgow T, Song J. PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics 2020; 36:704-712. [PMID: 31393553 DOI: 10.1093/bioinformatics/btz629] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 07/17/2019] [Accepted: 08/07/2019] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanju Zhang
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Sha Yu
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Ruopeng Xie
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Jiahui Li
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - André Leier
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - A Ian Smith
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Jiawei Wang
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| |
Collapse
|