1
|
Nafi MMI. Predicting C- and S-linked Glycosylation sites from protein sequences using protein language models. Comput Biol Med 2025; 189:109956. [PMID: 40073495 DOI: 10.1016/j.compbiomed.2025.109956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 02/25/2025] [Accepted: 02/27/2025] [Indexed: 03/14/2025]
Abstract
Among various post-translational modifications (PTMs), predicting C-linked and S-linked glycosites is an essential task, yet experimental techniques such as Capillary Electrophoresis (CE), Enzymatic Deglycosylation, and Mass Spectrometry (MS) are expensive. Therefore, computational techniques are required to predict these glycosites. Here, different language model embeddings and sequential features were explored. Two separate feature selection methods: Recursive Feature Elimination (RFE) and Particle Swarm Optimization (PSO) were employed and utilized for identifying the optimal feature set. Cross-validation results were generated for choosing the final models. Three sampling strategies to handle imbalanced datasets were examined: Random undersampling, Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). In this study, two models: DeepCSEmbed-C and DeepCSEmbed-S are proposed for C-linked and S-linked glycosylation prediction respectively. DeepCSEmbed-C is a dual-branch deep learning model comprising a Feedforward Neural Network (FNN) branch and an Inception branch, coupled with a Random undersampling strategy. DeepCSEmbed-S is a Categorical Boosting (CAT) model with the SMOTE oversampling strategy. DeepCSEmbed-C outperformed available state-of-the-art (SOTA) methods, achieving 92.9% sensitivity, 95.1% F1-score and 90.6% MCC on the Independent dataset. Datasets and python scripts for training and testing the models are provided and made freely accessible at https://github.com/nafcoder/DeepCSEmbed.
Collapse
|
2
|
Wu C, Xie X, Yang X, Du M, Lin H, Huang J. Applications of gene pair methods in clinical research: advancing precision medicine. MOLECULAR BIOMEDICINE 2025; 6:22. [PMID: 40202606 PMCID: PMC11982013 DOI: 10.1186/s43556-025-00263-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 03/18/2025] [Accepted: 03/21/2025] [Indexed: 04/10/2025] Open
Abstract
The rapid evolution of high-throughput sequencing technologies has revolutionized biomedical research, producing vast amounts of gene expression data that hold immense potential for biological discovery and clinical applications. Effectively mining these large-scale, high-dimensional data is crucial for facilitating disease detection, subtype differentiation, and understanding the molecular mechanisms underlying disease progression. However, the conventional paradigm of single-gene profiling, measuring absolute expression levels of individual genes, faces critical limitations in clinical implementation. These include vulnerability to batch effects and platform-dependent normalization requirements. In contrast, emerging approaches analyzing relative expression relationships between gene pairs demonstrate unique advantages. By focusing on binary comparisons of two genes' expression magnitudes, these methods inherently normalize experimental variations while capturing biologically stable interaction patterns. In this review, we systematically evaluate gene pair-based analytical frameworks. We classify eleven computational approaches into two fundamental categories: expression value-based methods quantifying differential expression patterns, and rank-based methods exploiting transcriptional ordering relationships. To bridge methodological development with practical implementation, we establish a reproducible analytical pipeline incorporating feature selection, classifier construction, and model evaluation modules using real-world benchmark datasets from pulmonary tuberculosis studies. These findings position gene pair analysis as a transformative paradigm for mining high-dimensional omics data, with direct implications for precision biomarker discovery and mechanistic studies of disease progression.
Collapse
Affiliation(s)
- Changchun Wu
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xueqin Xie
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xin Yang
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Mengze Du
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, 611844, China
| | - Hao Lin
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
| | - Jian Huang
- The Clinical Hospital of Chengdu Brain Science Institute, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
| |
Collapse
|
3
|
Kim K, Shim K, Wang YW, Yang D. Synthetic Biology Strategies for the Production of Natural Colorants and Their Non-Natural Derivatives. ACS Synth Biol 2025; 14:662-676. [PMID: 40066730 DOI: 10.1021/acssynbio.4c00799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2025]
Abstract
Colorants are widely used in our daily lives to give colors to diverse chemicals and materials, including clothes, food, drugs, cosmetics, and paints. Although synthetic colorants derived from fossil fuels have been predominantly used due to their low cost, there is a growing need to replace them with natural alternatives. This shift is driven by increasing concerns over the climate crisis caused by excessive fossil fuel use, as well as health issues associated with the consumption of foods, beverages, and cosmetics containing petroleum-derived chemicals. In addition, many natural colorants show health-promoting properties such as antioxidant and antimicrobial activities. Despite such advantages, natural colorants could not be readily commercialized and distributed in the market due to their low stability, limited color spectrum, and low yields from natural resources. To this end, synthetic biology approaches have been developed to efficiently produce natural colorants from renewable resources with high yields. Strategies to diversify natural colorants to produce non-natural derivatives with enhanced properties and an expanded color spectrum have been also developed. In this Review, we discuss the recent synthetic biology strategies developed for enhancing the production of natural colorants and their non-natural derivatives, together with accompanying examples. Challenges ahead and future perspectives are also discussed.
Collapse
Affiliation(s)
- Kyoungwon Kim
- Synthetic Biology and Enzyme Engineering Laboratory, Department of Chemical and Biological Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Kyubin Shim
- Synthetic Biology and Enzyme Engineering Laboratory, Department of Chemical and Biological Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Ying Wei Wang
- Synthetic Biology and Enzyme Engineering Laboratory, Department of Chemical and Biological Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Dongsoo Yang
- Synthetic Biology and Enzyme Engineering Laboratory, Department of Chemical and Biological Engineering, Korea University, Seoul 02841, Republic of Korea
| |
Collapse
|
4
|
Lee KW, Pham NT, Min HJ, Park HW, Lee JW, Lo HE, Kwon NY, Seo J, Shaginyan I, Cho H, Wei L, Manavalan B, Jeon YJ. DOGpred: A Novel Deep Learning Framework for Accurate Identification of Human O-linked Threonine Glycosylation Sites. J Mol Biol 2025; 437:168977. [PMID: 39900285 DOI: 10.1016/j.jmb.2025.168977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 01/06/2025] [Accepted: 01/28/2025] [Indexed: 02/05/2025]
Abstract
O-linked glycosylation is a crucial post-translational modification that regulates protein function and biological processes. Dysregulation of this process is associated with various diseases, underscoring the need to accurately identify O-linked glycosylation sites on proteins. Current experimental methods for identifying O-linked threonine glycosylation (OTG) sites are often complex and costly. Consequently, developing computational tools that predict these sites based on protein features is crucial. Such tools can complement experimental approaches, enhancing our understanding of the role of OTG dysregulation in diseases and uncovering potential therapeutic targets. In this study, we developed DOGpred, a deep learning-based predictor for precisely identifying human OTGs using high-latent feature representations. Initially, we extracted nine different conventional feature descriptors (CFDs) and nine pre-trained protein language model (PLM)-based embeddings. Notably, each feature was encoded as a 2D tensor, capturing both the sequential and inherent feature characteristics. Subsequently, we designed a stacked convolutional neural network (CNN) module to learn spatial feature representations from CFDs and a stacked recurrent neural network (RNN) module to learn temporal feature representations from PLM-based embeddings. These features were integrated using attention-based fusion mechanisms to generate high-level feature representations for final classification. Ablation analysis and independent tests demonstrated that the optimal model (DOGpred), employing a stacked 1D CNN and a stacked attention-based RNN modules with cross-attention feature fusion, achieved the best performance on the training dataset and significantly outperformed machine learning-based single-feature models and state-of-the-art methods on independent datasets. Furthermore, DOGpred is publicly available at https://github.com/JeonRPM/DOGpred/ for free access and usage.
Collapse
Affiliation(s)
- Ki Wook Lee
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hye Jung Min
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hyun Woo Park
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Ji Won Lee
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Han-En Lo
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Na Young Kwon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Jimin Seo
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Illia Shaginyan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Heeje Cho
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Leyi Wei
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macau
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Young-Jun Jeon
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea.
| |
Collapse
|
5
|
Wang H, Zhao L, Yu Z, Zeng X, Shi S. CoNglyPred: Accurate Prediction of N-Linked Glycosylation Sites Using ESM-2 and Structural Features With Graph Network and Co-Attention. Proteomics 2025; 25:e202400210. [PMID: 39361250 DOI: 10.1002/pmic.202400210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 08/17/2024] [Accepted: 09/20/2024] [Indexed: 03/18/2025]
Abstract
N-Linked glycosylation is crucial for various biological processes such as protein folding, immune response, and cellular transport. Traditional experimental methods for determining N-linked glycosylation sites entail substantial time and labor investment, which has led to the development of computational approaches as a more efficient alternative. However, due to the limited availability of 3D structural data, existing prediction methods often struggle to fully utilize structural information and fall short in integrating sequence and structural information effectively. Motivated by the progress of protein pretrained language models (pLMs) and the breakthrough in protein structure prediction, we introduced a high-accuracy model called CoNglyPred. Having compared various pLMs, we opt for the large-scale pLM ESM-2 to extract sequence embeddings, thus mitigating certain limitations associated with manual feature extraction. Meanwhile, our approach employs a graph transformer network to process the 3D protein structures predicted by AlphaFold2. The final graph output and ESM-2 embedding are intricately integrated through a co-attention mechanism. Among a series of comprehensive experiments on the independent test dataset, CoNglyPred outperforms state-of-the-art models and demonstrates exceptional performance in case study. In addition, we are the first to report the uncertainty of N-linked glycosylation predictors using expected calibration error and expected uncertainty calibration error.
Collapse
Affiliation(s)
- Hongmei Wang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Long Zhao
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Ziyuan Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Ximin Zeng
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, China
| |
Collapse
|
6
|
Pakhrin SC, Chauhan N, Khan S, Upadhyaya J, Beck MR, Blanco E. Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae643. [PMID: 39447059 PMCID: PMC11552629 DOI: 10.1093/bioinformatics/btae643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 10/02/2024] [Accepted: 10/23/2024] [Indexed: 10/26/2024]
Abstract
MOTIVATION O-linked glycosylation, an essential post-translational modification process in Homo sapiens, involves attaching sugar moieties to the oxygen atoms of serine and/or threonine residues. It influences various biological and cellular functions. While threonine or serine residues within protein sequences are potential sites for O-linked glycosylation, not all serine and/or threonine residues undergo this modification, underscoring the importance of characterizing its occurrence. This study presents a novel approach for predicting intracellular and extracellular O-linked glycosylation events on proteins, which are crucial for comprehending cellular processes. Two base multi-layer perceptron models were trained by leveraging a stacked generalization framework. These base models respectively use ProtT5 and Ankh O-linked glycosylation site-specific embeddings whose combined predictions are used to train the meta-multi-layer perceptron model. Trained on extensive O-linked glycosylation datasets, the stacked-generalization model demonstrated high predictive performance on independent test datasets. Furthermore, the study emphasizes the distinction between nucleocytoplasmic and extracellular O-linked glycosylation, offering insights into their functional implications that were overlooked in previous studies. By integrating the protein language model's embedding with stacked generalization techniques, this approach enhances predictive accuracy of O-linked glycosylation events and illuminates the intricate roles of O-linked glycosylation in proteomics, potentially accelerating the discovery of novel glycosylation sites. RESULTS Stack-OglyPred-PLM produces Sensitivity, Specificity, Matthews Correlation Coefficient, and Accuracy of 90.50%, 89.60%, 0.464, and 89.70%, respectively on a benchmark NetOGlyc-4.0 independent test dataset. These results demonstrate that Stack-OglyPred-PLM is a robust computational tool to predict O-linked glycosylation sites in proteins. AVAILABILITY AND IMPLEMENTATION The developed tool, programs, training, and test dataset are available at https://github.com/PakhrinLab/Stack-OglyPred-PLM.
Collapse
Affiliation(s)
- Subash Chandra Pakhrin
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States
| | - Neha Chauhan
- School of Computing, Wichita State University, Wichita, KS 67260, United States
| | - Salman Khan
- Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, United States
| | - Jamie Upadhyaya
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, United States
| | - Moriah Rene Beck
- Department of Chemistry and Biochemistry, Wichita State University, Wichita, KS 67260, United States
| | - Eduardo Blanco
- Department of Computer Science, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
7
|
Zhang L, Deng T, Pan S, Zhang M, Zhang Y, Yang C, Yang X, Tian G, Mi J. DeepO-GlcNAc: a web server for prediction of protein O-GlcNAcylation sites using deep learning combined with attention mechanism. Front Cell Dev Biol 2024; 12:1456728. [PMID: 39450274 PMCID: PMC11500328 DOI: 10.3389/fcell.2024.1456728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 09/26/2024] [Indexed: 10/26/2024] Open
Abstract
Introduction Protein O-GlcNAcylation is a dynamic post-translational modification involved in major cellular processes and associated with many human diseases. Bioinformatic prediction of O-GlcNAc sites before experimental validation is a challenge task in O-GlcNAc research. Recent advancements in deep learning algorithms and the availability of O-GlcNAc proteomics data present an opportunity to improve O-GlcNAc site prediction. Objectives This study aims to develop a deep learning-based tool to improve O-GlcNAcylation site prediction. Methods We construct an annotated unbalanced O-GlcNAcylation data set and propose a new deep learning framework, DeepO-GlcNAc, using Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) combined with attention mechanism. Results The ablation study confirms that the additional model components in DeepO-GlcNAc, such as attention mechanisms and LSTM, contribute positively to improving prediction performance. Our model demonstrates strong robustness across five cross-species datasets, excluding humans. We also compare our model with three external predictors using an independent dataset. Our results demonstrated that DeepO-GlcNAc outperforms the external predictors, achieving an accuracy of 92%, an average precision of 72%, a MCC of 0.60, and an AUC of 92% in ROC analysis. Moreover, we have implemented DeepO-GlcNAc as a web server to facilitate further investigation and usage by the scientific community. Conclusion Our work demonstrates the feasibility of utilizing deep learning for O-GlcNAc site prediction and provides a novel tool for O-GlcNAc investigation.
Collapse
Affiliation(s)
- Liyuan Zhang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Tingzhi Deng
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China
| | - Shuijing Pan
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Minghui Zhang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, Shandong, China
| | - Chunhua Yang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Xiaoyong Yang
- Department of Comparative Medicine, Department of Cellular and Molecular Physiology, Yale University, New Haven, CT, United States
| | - Geng Tian
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Jia Mi
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| |
Collapse
|
8
|
Pham NT, Zhang Y, Rakkiyappan R, Manavalan B. HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach. Comput Biol Med 2024; 179:108859. [PMID: 39029431 DOI: 10.1016/j.compbiomed.2024.108859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/19/2024] [Accepted: 07/06/2024] [Indexed: 07/21/2024]
Abstract
O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Ying Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Rajan Rakkiyappan
- Department of Mathematics, Bharathiar University, Coimbatore, 641046, Tamil Nadu, India.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
9
|
Jang L, Kim A, Park CS, Moon C, Kim M, Kim J, Yang S, Jang JY, Jeong CM, Lee HS, Park J, Kim K, Byeon H, Kim HH. Fucosylation and galactosylation in N-glycans of bovine intestinal alkaline phosphatase and their role in its enzymatic activity. Arch Biochem Biophys 2024; 758:110069. [PMID: 38914216 DOI: 10.1016/j.abb.2024.110069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 06/20/2024] [Accepted: 06/20/2024] [Indexed: 06/26/2024]
Abstract
Bovine intestinal alkaline phosphatase (biALP), a membrane-bound plasma metalloenzyme, maintains intestinal homeostasis, regulates duodenal surface pH, and protects against infections caused by pathogenic bacteria. The N-glycans of biALP regulate its enzymatic activity, protein folding, and thermostability, but their structures are not fully reported. In this study, the structures and quantities of the N-glycans of biALP were analyzed by liquid chromatography-electrospray ionization-high energy collision dissociation-tandem mass spectrometry. In total, 48 N-glycans were identified and quantified, comprising high-mannose [6 N-glycans, 33.1 % (sum of relative quantities of each N-glycan)], hybrid (6, 11.9 %), and complex (36, 55.0 %) structures [bi- (13, 26.1 %), tri- (16, 21.5 %), and tetra-antennary (7, 7.4 %)]. These included bisecting N-acetylglucosamine (33, 56.6 %), mono-to tri-fucosylation (32, 53.3 %), mono-to tri-α-galactosylation (16, 20.7 %), and mono-to tetra-β-galactosylation (36, 58.5 %). No sialylation was identified. N-glycans with non-bisecting GlcNAc (9, 10.3 %), non-fucosylation (10, 13.6 %), non-α-galactosylation (26, 46.2 %), and non-β-galactosylation (6, 8.4 %) were also identified. The activity (100 %) of biALP was reduced to 37.3 ± 0.2 % (by de-fucosylation), 32.7 ± 2.9 % (by de-α-galactosylation), and 0.2 ± 0.2 % (by de-β-galactosylation), comparable to inhibition by 10-4 to 101 mM EDTA, a biALP inhibitor. These results indicate that fucosylated and galactosylated N-glycans, especially β-galactosylation, affected the activity of biALP. This study is the first to identify 48 diverse N-glycan structures and quantities of bovine as well as human intestinal ALP and to demonstrate the importance of the role of fucosylation and galactosylation for maintaining the activity of biALP.
Collapse
Affiliation(s)
- Leeseul Jang
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Ahyeon Kim
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Chi Soo Park
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Chulmin Moon
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Mirae Kim
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Jieun Kim
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Subin Yang
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Ji Yeon Jang
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Chang Myeong Jeong
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Han Seul Lee
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Juhee Park
- Department of Pharmaceutical Regulatory Sciences, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Kyuran Kim
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Haeun Byeon
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea
| | - Ha Hyung Kim
- Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea; Department of Pharmaceutical Regulatory Sciences, Graduate School of Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, Republic of Korea.
| |
Collapse
|
10
|
Hu F, Gao J, Zheng J, Kwoh C, Jia C. N-GlycoPred: A hybrid deep learning model for accurate identification of N-glycosylation sites. Methods 2024; 227:48-57. [PMID: 38734394 DOI: 10.1016/j.ymeth.2024.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/16/2024] [Accepted: 05/03/2024] [Indexed: 05/13/2024] Open
Abstract
Studies have shown that protein glycosylation in cells reflects the real-time dynamics of biological processes, and the occurrence and development of many diseases are closely related to protein glycosylation. Abnormal protein glycosylation can be used as a potential diagnostic and prognostic marker of a disease, as well as a therapeutic target and a new breakthrough point for exploring pathogenesis. To address the issue of significant differences in the prediction results of previous models for different species, we constructed a hybrid deep learning model N-GlycoPred on the basis of dual-layer convolution, a paired attention mechanism and BiLSTM for accurate identification of N-glycosylation sites. By adopting one-hot encoding or the AAindex, we specifically selected the optimum combination of features and deep learning frameworks for human and mouse to refine the models. Based on six independent test datasets, our N-GlycoPred model achieved an average AUC of 0.9553, which is 0.23% higher than MusiteDeep. The comparison results indicate that our model can serve as a powerful tool for N-glycosylation site prescreening for biological researchers.
Collapse
Affiliation(s)
- Fengzhu Hu
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jie Gao
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Cheekeong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China.
| |
Collapse
|
11
|
Yom A, Chiang A, Lewis NE. Boltzmann Model Predicts Glycan Structures from Lectin Binding. Anal Chem 2024; 96:8332-8341. [PMID: 38720429 PMCID: PMC11162346 DOI: 10.1021/acs.analchem.3c04992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2024]
Abstract
Glycans are complex oligosaccharides that are involved in many diseases and biological processes. Unfortunately, current methods for determining glycan composition and structure (glycan sequencing) are laborious and require a high level of expertise. Here, we assess the feasibility of sequencing glycans based on their lectin binding fingerprints. By training a Boltzmann model on lectin binding data, we predict the approximate structures of 88 ± 7% of N-glycans and 87 ± 13% of O-glycans in our test set. We show that our model generalizes well to the pharmaceutically relevant case of Chinese hamster ovary (CHO) cell glycans. We also analyze the motif specificity of a wide array of lectins and identify the most and least predictive lectins and glycan features. These results could help streamline glycoprotein research and be of use to anyone using lectins for glycobiology.
Collapse
Affiliation(s)
- Aria Yom
- Department of Physics, University of California, San Diego, California 92093, United States
| | - Austin Chiang
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Immunology Center of Georgia, Augusta University, Augusta, Georgia 30912, United States
- Department of Medicine, Augusta University, Augusta, Georgia 30912, United States
| | - Nathan E Lewis
- Department of Pediatrics, University of California, San Diego, California 92093, United States
- Department of Bioengineering, University of California, San Diego, California 92093, United States
| |
Collapse
|
12
|
Xin R, Zhang F, Zheng J, Zhang Y, Yu C, Feng X. SDBA: Score Domain-Based Attention for DNA N4-Methylcytosine Site Prediction from Multiperspectives. J Chem Inf Model 2024; 64:2839-2853. [PMID: 37646411 DOI: 10.1021/acs.jcim.3c00688] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
In tasks related to DNA sequence classification, choosing the appropriate encoding methods is challenging. Some of the methods encode sequences based on prior knowledge that limits the ability of the model to obtain multiperspective information from the sequences. We introduced a new trainable ensemble method based on the attention mechanism SDBA, which stands for Score Domain-Based Attention. Unlike other methods, we fed the task-independent encoding results into the models and dynamically ensembled features from different perspectives using the SDBA mechanism. This approach allows the model to acquire and weight sequence features voluntarily. SDBA is conceptually general and empirically powerful. It has achieved new state-of-the-art results on the benchmark data sets associated with DNA N4-methylcytosine site prediction.
Collapse
Affiliation(s)
- Ruihao Xin
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Fan Zhang
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
| | - Jiaxin Zheng
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Yangyi Zhang
- University of Melbourne Centre for Cancer Research, Victorian Comprehensive Cancer Centre, University of Melbourne, Parkville, Victoria 3050, Australia
| | - Cuinan Yu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun 130012, P.R. China
| |
Collapse
|
13
|
Gao J, Zhao Y, Chen C, Ning Q. MVNN-HNHC:A multi-view neural network for identification of human non-histone crotonylation sites. Anal Biochem 2024; 687:115426. [PMID: 38141798 DOI: 10.1016/j.ab.2023.115426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 11/21/2023] [Accepted: 12/06/2023] [Indexed: 12/25/2023]
Abstract
Crotonylation on lysine sites in human non-histone proteins plays a crucial role in biology activities. However, because traditional experimental methods for crotonylation site identification are time-consuming and labor-intensive, computational prediction methods have become increasingly popular in recent years. Despite its significance, crotonylation site prediction has received less attention in non-histone proteins than in histones. In this study, we proposed a Multi-View Neural Network for identification of Human Non-Histone Crotonylation sites, named MVNN-HNHC. MVNN-HNHC integrated multi-view encoding features and adaptive encoding features through multi-channel neural network to deeply learn about attribute differences between crotonylation sites and non-crotonylation sites from various aspects. In MVNN-HNHC, convolutional neural networks can obtain local information from these features, and bidirectional long short term memory networks were utilized to extract sequence information. Then, we employ the attention mechanism to fuse the outputs of various feature extraction modules. Finally, the fully connection network acted as the classifier to predict whether a lysine site was crotonylation site or non-crotonylation site. Performance metrics on independent test set, including sensitivity, specificity, accuracy, Matthews correlation coefficient, and area under the curve (AUC) values reach 80.06 %, 75.77 %, 77.06 %, 0.5203, and 0.7792, respectively. To verify the effectiveness of this method, we carry out a series of experiments and the results show that MVNN-HNHC is an effective tool for predicting crotonylation sites in non-histone proteins. The data and code are available on https://github.com/xbbxhbc/junjun0612.git.
Collapse
Affiliation(s)
- Jun Gao
- Department of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China
| | - Yaomiao Zhao
- Department of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China
| | - Chen Chen
- Naval Architecture and Ocean Engineering College, Dalian Maritime University, Dalian, 116026, China.
| | - Qiao Ning
- Department of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China.
| |
Collapse
|
14
|
He K, Baniasad M, Kwon H, Caval T, Xu G, Lebrilla C, Hommes DW, Bertozzi C. Decoding the glycoproteome: a new frontier for biomarker discovery in cancer. J Hematol Oncol 2024; 17:12. [PMID: 38515194 PMCID: PMC10958865 DOI: 10.1186/s13045-024-01532-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Accepted: 03/04/2024] [Indexed: 03/23/2024] Open
Abstract
Cancer early detection and treatment response prediction continue to pose significant challenges. Cancer liquid biopsies focusing on detecting circulating tumor cells (CTCs) and DNA (ctDNA) have shown enormous potential due to their non-invasive nature and the implications in precision cancer management. Recently, liquid biopsy has been further expanded to profile glycoproteins, which are the products of post-translational modifications of proteins and play key roles in both normal and pathological processes, including cancers. The advancements in chemical and mass spectrometry-based technologies and artificial intelligence-based platforms have enabled extensive studies of cancer and organ-specific changes in glycans and glycoproteins through glycomics and glycoproteomics. Glycoproteomic analysis has emerged as a promising tool for biomarker discovery and development in early detection of cancers and prediction of treatment efficacy including response to immunotherapies. These biomarkers could play a crucial role in aiding in early intervention and personalized therapy decisions. In this review, we summarize the significant advance in cancer glycoproteomic biomarker studies and the promise and challenges in integration into clinical practice to improve cancer patient care.
Collapse
Affiliation(s)
- Kai He
- James Comprehensive Cancer Center, The Ohio State University, Columbus, USA.
| | | | - Hyunwoo Kwon
- James Comprehensive Cancer Center, The Ohio State University, Columbus, USA
| | | | - Gege Xu
- InterVenn Biosciences, South San Francisco, USA
| | - Carlito Lebrilla
- Department of Biochemistry and Molecular Medicine, UC Davis Health, Sacramento, USA
| | | | | |
Collapse
|
15
|
Hu F, Li W, Li Y, Hou C, Ma J, Jia C. O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning. J Proteome Res 2024; 23:95-106. [PMID: 38054441 DOI: 10.1021/acs.jproteome.3c00458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
O-linked β-N-acetylglucosamine (O-GlcNAc) is a post-translational modification (i.e., O-GlcNAcylation) on serine/threonine residues of proteins, regulating a plethora of physiological and pathological events. As a dynamic process, O-GlcNAc functions in a site-specific manner. However, the experimental identification of the O-GlcNAc sites remains challenging in many scenarios. Herein, by leveraging the recent progress in cataloguing experimentally identified O-GlcNAc sites and advanced deep learning approaches, we establish an ensemble model, O-GlcNAcPRED-DL, a deep learning-based tool, for the prediction of O-GlcNAc sites. In brief, to make a benchmark O-GlcNAc data set, we extracted the information on O-GlcNAc from the recently constructed database O-GlcNAcAtlas, which contains thousands of experimentally identified and curated O-GlcNAc sites on proteins from multiple species. To overcome the imbalance between positive and negative data sets, we selected five groups of negative data sets in humans and mice to construct an ensemble predictor based on connection of a convolutional neural network and bidirectional long short-term memory. By taking into account three types of sequence information, we constructed four network frameworks, with the systematically optimized parameters used for the models. The thorough comparison analysis on two independent data sets of humans and mice and six independent data sets from other species demonstrated remarkably increased sensitivity and accuracy of the O-GlcNAcPRED-DL models, outperforming other existing tools. Moreover, a user-friendly Web server for O-GlcNAcPRED-DL has been constructed, which is freely available at http://oglcnac.org/pred_dl.
Collapse
Affiliation(s)
- Fengzhu Hu
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Weiyu Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Yaoxiang Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Chunyan Hou
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Junfeng Ma
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
16
|
Yu F, Zhang Z, Leng Y, Chen AF. O-GlcNAc modification of GSDMD attenuates LPS-induced endothelial cells pyroptosis. Inflamm Res 2024; 73:5-17. [PMID: 37962578 PMCID: PMC10776498 DOI: 10.1007/s00011-023-01812-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 10/21/2023] [Accepted: 10/25/2023] [Indexed: 11/15/2023] Open
Abstract
OBJECTIVE Increased O-linked β-N-acetylglucosamine (O-GlcNAc) stimulation has been reported to protect against sepsis associated mortality and cardiovascular derangement. Previous studies, including our own research, have indicated that gasdermin-D(GSDMD)-mediated endothelial cells pyroptosis contributes to sepsis-associated endothelial injury. This study explored the functions and mechanisms of O-GlcNAc modification on lipopolysaccharide (LPS)-induced pyroptosis and its effects on the function of GSDMD. METHODS A LPS-induced septic mouse model administrated with O-GlcNAcase (OGA) inhibitor thiamet-G (TMG) was used to assess the effects of O-GlcNAcylation on sepsis-associated vascular dysfunction and pyroptosis. We conducted experiments on human umbilical vein endothelial cells (HUVECs) by challenging them with LPS and TMG to investigate the impact of O-GlcNAcylation on endothelial cell pyroptosis and implications of GSDMD. Additionally, we identified potential O-GlcNAcylation sites in GSDMD by utilizing four public O-GlcNAcylation site prediction database, and these sites were ultimately established through gene mutation. RESULTS Septic mice with increased O-GlcNAc stimulation exhibited reduced endothelial injury, GSDMD cleavage (a marker of pyroptosis). O-GlcNAc modification of GSDMD mitigates LPS-induced pyroptosis in endothelial cells by preventing its interaction with caspase-11 (a human homologous of caspases-4/5). We also identified GSDMD Serine 338 (S338) as a novel site of O-GlcNAc modification, leading to decreased association with caspases-4 in HEK293T cells. CONCLUSIONS Our findings identified a novel post-translational modification of GSDMD and elucidated the O-GlcNAcylation of GSDMD inhibits LPS-induced endothelial injury, suggesting that O-GlcNAc modification-based treatments could serve as potential interventions for sepsis-associated vascular endothelial injury.
Collapse
Affiliation(s)
- Fan Yu
- Department of Cardiology, The Third Xiangya Hospital of Central South University, Changsha, China
- Research Center for Life Science and Human Health, Binjiang Institute of Zhejiang University, Hangzhou, Zhejiang, China
| | - Zhen Zhang
- Department of Cardiology, The Third Xiangya Hospital of Central South University, Changsha, China
| | - Yiping Leng
- The Affiliated Changsha Central Hospital, Research Center for Phase I Clinical Trials, Hengyang Medical School, University of South China, Changsha, Hunan, China
| | - Alex F Chen
- Department of Cardiology, The Third Xiangya Hospital of Central South University, Changsha, China.
- Department of Cardiology, Institute for Cardiovascular Development and Regenerative Medicine, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, 1665 Kongjiang Road, Shanghai, 200092, China.
| |
Collapse
|
17
|
Li X, Wang GA, Wei Z, Wang H, Zhu X. Protein-DNA interface hotspots prediction based on fusion features of embeddings of protein language model and handcrafted features. Comput Biol Chem 2023; 107:107970. [PMID: 37866116 DOI: 10.1016/j.compbiolchem.2023.107970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 10/06/2023] [Accepted: 10/07/2023] [Indexed: 10/24/2023]
Abstract
The identification of hotspot residues at the protein-DNA binding interfaces plays a crucial role in various aspects such as drug discovery and disease treatment. Although experimental methods such as alanine scanning mutagenesis have been developed to determine the hotspot residues on protein-DNA interfaces, they are both inefficient and costly. Therefore, it is highly necessary to develop efficient and accurate computational methods for predicting hotspot residues. Several computational methods have been developed, however, they are mainly based on hand-crafted features which may not be able to represent all the information of proteins. In this regard, we propose a model called PDH-EH, which utilizes fused features of embeddings extracted from a protein language model (PLM) and handcrafted features. After we extracted the total 1141 dimensional features, we used mRMR to select the optimal feature subset. Based on the optimal feature subset, several different learning algorithms such as Random Forest, Support Vector Machine, and XGBoost were used to build the models. The cross-validation results on the training dataset show that the model built by using Random Forest achieves the highest AUROC. Further evaluation on the independent test set shows that our model outperforms the existing state-of-the-art models. Moreover, the effectiveness and interpretability of embeddings extracted from PLM were demonstrated in our analysis. The codes and datasets used in this study are available at: https://github.com/lixiangli01/PDH-EH.
Collapse
Affiliation(s)
- Xiang Li
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Gang-Ao Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Hong Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
18
|
Hou X, Wang Y, Bu D, Wang Y, Sun S. EMNGly: predicting N-linked glycosylation sites using the language models for feature extraction. Bioinformatics 2023; 39:btad650. [PMID: 37930896 PMCID: PMC10627407 DOI: 10.1093/bioinformatics/btad650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/14/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION N-linked glycosylation is a frequently occurring post-translational protein modification that serves critical functions in protein folding, stability, trafficking, and recognition. Its involvement spans across multiple biological processes and alterations to this process can result in various diseases. Therefore, identifying N-linked glycosylation sites is imperative for comprehending the mechanisms and systems underlying glycosylation. Due to the inherent experimental complexities, machine learning and deep learning have become indispensable tools for predicting these sites. RESULTS In this context, a new approach called EMNGly has been proposed. The EMNGly approach utilizes pretrained protein language model (Evolutionary Scale Modeling) and pretrained protein structure model (Inverse Folding Model) for features extraction and support vector machine for classification. Ten-fold cross-validation and independent tests show that this approach has outperformed existing techniques. And it achieves Matthews Correlation Coefficient, sensitivity, specificity, and accuracy of 0.8282, 0.9343, 0.8934, and 0.9143, respectively on a benchmark independent test set.
Collapse
Affiliation(s)
- Xiaoyang Hou
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yu Wang
- Syneron Technology, Guangzhou 510000, China
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yaojun Wang
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
| | - Shiwei Sun
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
19
|
Zeng Y, Yuan Z, Chen Y, Hu Y. CBDT-Oglyc: Prediction of O-glycosylation sites using ChiMIC-based balanced decision table and feature selection. J Bioinform Comput Biol 2023; 21:2350024. [PMID: 37899352 DOI: 10.1142/s0219720023500245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
O-glycosylation (Oglyc) plays an important role in various biological processes. The key to understanding the mechanisms of Oglyc is identifying the corresponding glycosylation sites. Two critical steps, feature selection and classifier design, greatly affect the accuracy of computational methods for predicting Oglyc sites. Based on an efficient feature selection algorithm and a classifier capable of handling imbalanced datasets, a new computational method, ChiMIC-based balanced decision table O-glycosylation (CBDT-Oglyc), is proposed. ChiMIC-based balanced decision table for O-glycosylation (CBDT-Oglyc), is proposed to predict Oglyc sites in proteins. Sequence characterization is performed by combining amino acid composition (AAC), undirected composition of [Formula: see text]-spaced amino acid pairs (undirected-CKSAAP) and pseudo-position-specific scoring matrix (PsePSSM). Chi-MIC-share algorithm is used for feature selection, which simplifies the model and improves predictive accuracy. For imbalanced classification, a backtracking method based on local chi-square test is designed, and then cost-sensitive learning is incorporated to construct a novel classifier named ChiMIC-based balanced decision table (CBDT). Based on a 1:49 (positives:negatives) training set, the CBDT classifier achieves significantly better prediction performance than traditional classifiers. Moreover, the independent test results on separate human and mouse glycoproteins show that CBDT-Oglyc outperforms previous methods in global accuracy. CBDT-Oglyc shows great promise in predicting Oglyc sites and is expected to facilitate further experimental studies on protein glycosylation.
Collapse
Affiliation(s)
- Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, Hunan, P. R. China
| | - Zheming Yuan
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, Hunan, P. R. China
| | - Yuan Chen
- Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha 410128, Hunan, P. R. China
| | - Ying Hu
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, Hunan, P. R. China
| |
Collapse
|
20
|
Li F, Wang C, Guo X, Akutsu T, Webb GI, Coin LJM, Kurgan L, Song J. ProsperousPlus: a one-stop and comprehensive platform for accurate protease-specific substrate cleavage prediction and machine-learning model construction. Brief Bioinform 2023; 24:bbad372. [PMID: 37874948 DOI: 10.1093/bib/bbad372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 08/30/2023] [Accepted: 09/29/2023] [Indexed: 10/26/2023] Open
Abstract
Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.
Collapse
Affiliation(s)
- Fuyi Li
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Cong Wang
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Data Futures Institute, Monash University, VIC 3800, Australia
| | - Lachlan J M Coin
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Jiangning Song
- Monash Data Futures Institute, Monash University, VIC 3800, Australia
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| |
Collapse
|
21
|
Waury K, de Wit R, Verberk IMW, Teunissen CE, Abeln S. Deciphering Protein Secretion from the Brain to Cerebrospinal Fluid for Biomarker Discovery. J Proteome Res 2023; 22:3068-3080. [PMID: 37606934 PMCID: PMC10476268 DOI: 10.1021/acs.jproteome.3c00366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Indexed: 08/23/2023]
Abstract
Cerebrospinal fluid (CSF) is an essential matrix for the discovery of neurological disease biomarkers. However, the high dynamic range of protein concentrations in CSF hinders the detection of the least abundant protein biomarkers by untargeted mass spectrometry. It is thus beneficial to gain a deeper understanding of the secretion processes within the brain. Here, we aim to explore if and how the secretion of brain proteins to the CSF can be predicted. By combining a curated CSF proteome and the brain elevated proteome of the Human Protein Atlas, brain proteins were classified as CSF or non-CSF secreted. A machine learning model was trained on a range of sequence-based features to differentiate between CSF and non-CSF groups and effectively predict the brain origin of proteins. The classification model achieves an area under the curve of 0.89 if using high confidence CSF proteins. The most important prediction features include the subcellular localization, signal peptides, and transmembrane regions. The classifier generalized well to the larger brain detected proteome and is able to correctly predict novel CSF proteins identified by affinity proteomics. In addition to elucidating the underlying mechanisms of protein secretion, the trained classification model can support biomarker candidate selection.
Collapse
Affiliation(s)
- Katharina Waury
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Renske de Wit
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| | - Inge M. W. Verberk
- Neurochemistry
Laboratory, Department of Clinical Chemistry, Amsterdam Neuroscience, VU University Medical Center, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Charlotte E. Teunissen
- Neurochemistry
Laboratory, Department of Clinical Chemistry, Amsterdam Neuroscience, VU University Medical Center, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Sanne Abeln
- Department
of Computer Science, Vrije Universiteit
Amsterdam, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
22
|
Li F, Guo X, Bi Y, Jia R, Pitt ME, Pan S, Li S, Gasser RB, Coin LJ, Song J. Digerati - A multipath parallel hybrid deep learning framework for the identification of mycobacterial PE/PPE proteins. Comput Biol Med 2023; 163:107155. [PMID: 37356289 DOI: 10.1016/j.compbiomed.2023.107155] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 06/05/2023] [Accepted: 06/07/2023] [Indexed: 06/27/2023]
Abstract
The genome of Mycobacterium tuberculosis contains a relatively high percentage (10%) of genes that are poorly characterised because of their highly repetitive nature and high GC content. Some of these genes encode proteins of the PE/PPE family, which are thought to be involved in host-pathogen interactions, virulence, and disease pathogenicity. Members of this family are genetically divergent and challenging to both identify and classify using conventional computational tools. Thus, advanced in silico methods are needed to identify proteins of this family for subsequent functional annotation efficiently. In this study, we developed the first deep learning-based approach, termed Digerati, for the rapid and accurate identification of PE and PPE family proteins. Digerati was built upon a multipath parallel hybrid deep learning framework, which equips multi-layer convolutional neural networks with bidirectional, long short-term memory, equipped with a self-attention module to effectively learn the higher-order feature representations of PE/PPE proteins. Empirical studies demonstrated that Digerati achieved a significantly better performance (∼18-20%) than alignment-based approaches, including BLASTP, PHMMER, and HHsuite, in both prediction accuracy and speed. Digerati is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE/PPE family members. The webserver and source codes of Digerati are publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/Digerati/.
Collapse
Affiliation(s)
- Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China; Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria, 3000, Australia.
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia
| | - Runchang Jia
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Miranda E Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria, 3000, Australia
| | - Shirui Pan
- School of Information and Communication Technology, Griffith University, QLD, 4222, Australia
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Robin B Gasser
- Melbourne Veterinary School, Faculty of Science, The University of Melbourne, VIC, 3010, Australia
| | - Lachlan Jm Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria, 3000, Australia.
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia.
| |
Collapse
|
23
|
Zhang Z, Li F, Zhao J, Zheng C. CapsNetYY1: identifying YY1-mediated chromatin loops based on a capsule network architecture. BMC Genomics 2023; 24:448. [PMID: 37559017 PMCID: PMC10410878 DOI: 10.1186/s12864-023-09217-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 02/28/2023] [Indexed: 08/11/2023] Open
Abstract
BACKGROUND Previous studies have identified that chromosome structure plays a very important role in gene control. The transcription factor Yin Yang 1 (YY1), a multifunctional DNA binding protein, could form a dimer to mediate chromatin loops and active enhancer-promoter interactions. The deletion of YY1 or point mutations at the YY1 binding sites significantly inhibit the enhancer-promoter interactions and affect gene expression. To date, only a few computational methods are available for identifying YY1-mediated chromatin loops. RESULTS We proposed a novel model named CapsNetYY1, which was based on capsule network architecture to identify whether a pair of YY1 motifs can form a chromatin loop. Firstly, we encode the DNA sequence using one-hot encoding method. Secondly, multi-scale convolution layer is used to extract local features of the sequence, and bidirectional gated recurrent unit is used to learn the features across time steps. Finally, capsule networks (convolution capsule layer and digital capsule layer) used to extract higher level features and recognize YY1-mediated chromatin loops. Compared with DeepYY1, the only prediction for YY1-mediated chromatin loops, our model CapsNetYY1 achieved the better performance on the independent datasets (AUC [Formula: see text]). CONCLUSION The results indicate that CapsNetYY1 is an excellent method for identifying YY1-mediated chromatin loops. We believe that the CapsNetYY1 method will be used for predictive classification of other DNA sequences.
Collapse
Affiliation(s)
- Zhimin Zhang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Fenglin Li
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jianping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| | - Chunhou Zheng
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, China.
| |
Collapse
|
24
|
Grass GD, Ercan D, Obermayer AN, Shaw T, Stewart PA, Chahoud J, Dhillon J, Lopez A, Johnstone PAS, Rogatto SR, Spiess PE, Eschrich SA. An Assessment of the Penile Squamous Cell Carcinoma Surfaceome for Biomarker and Therapeutic Target Discovery. Cancers (Basel) 2023; 15:3636. [PMID: 37509297 PMCID: PMC10377392 DOI: 10.3390/cancers15143636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/01/2023] [Accepted: 07/11/2023] [Indexed: 07/30/2023] Open
Abstract
Penile squamous cell carcinoma (PSCC) is a rare malignancy in most parts of the world and the underlying mechanisms of this disease have not been fully investigated. About 30-50% of cases are associated with high-risk human papillomavirus (HPV) infection, which may have prognostic value. When PSCC becomes resistant to upfront therapies there are limited options, thus further research is needed in this venue. The extracellular domain-facing protein profile on the cell surface (i.e., the surfaceome) is a key area for biomarker and drug target discovery. This research employs computational methods combined with cell line translatomic (n = 5) and RNA-seq transcriptomic data from patient-derived tumors (n = 18) to characterize the PSCC surfaceome, evaluate the composition dependency on HPV infection, and explore the prognostic impact of identified surfaceome candidates. Immunohistochemistry (IHC) was used to validate the localization of select surfaceome markers. This analysis characterized a diverse surfaceome within patient tumors with 25% and 18% of the surfaceome represented by the functional classes of receptors and transporters, respectively. Significant differences in protein classes were noted by HPV status, with the most change being seen in transporter proteins (25%). IHC confirmed the robust surface expression of select surfaceome targets in the top 85% of expression and a superfamily immunoglobulin protein called BSG/CD147 was prognostic of survival. This study provides the first description of the PSCC surfaceome and its relation to HPV infection and sets a foundation for novel biomarker and drug target discovery in this rare cancer.
Collapse
Affiliation(s)
- George Daniel Grass
- Department of Radiation Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Dalia Ercan
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Alyssa N Obermayer
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Timothy Shaw
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Paul A Stewart
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Jad Chahoud
- Department of Genitourinary Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Jasreman Dhillon
- Department of Anatomic Pathology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Alex Lopez
- Department of Anatomic Pathology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Peter A S Johnstone
- Department of Radiation Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Silvia Regina Rogatto
- Department of Clinical Genetics, University Hospital of Southern Denmark-Vejle, Beriderbakken 4, 7100 Vejle, Denmark
| | - Philippe E Spiess
- Department of Genitourinary Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| | - Steven A Eschrich
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL 33612, USA
| |
Collapse
|
25
|
Tang H, Tang Q, Zhang Q, Feng P. O-GlyThr: Prediction of human O-linked threonine glycosites using multi-feature fusion. Int J Biol Macromol 2023; 242:124761. [PMID: 37156312 DOI: 10.1016/j.ijbiomac.2023.124761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/01/2023] [Accepted: 05/02/2023] [Indexed: 05/10/2023]
Abstract
O-linked glycosylation is one of the most complex post-translational modifications (PTM) of human proteins modulating various cellular metabolic and signaling pathways. Unlike N-glycosylation, the O-glycosylation has nonspecific sequence features and nonstable glycan core structure, which makes identification of O-glycosites more challenging either by experimental or computational methods. Biochemical experiments to identify O-glycosites in batches are technically and economically demanding. Therefore, development of computation-based methods is greatly warranted. This study constructed a prediction model based on feature fusion for O-glycosites linked to the threonine residues in Homo sapiens. In the training model, we collected and sorted out high-quality human protein data with O-linked threonine glycosites. Seven feature coding methods were fused to represent the sample sequence. By comparison of different algorithms, random forest was selected as the final classifier to construct the classification model. Through 5-fold cross-validation, the proposed model, namely O-GlyThr, performed satisfactorily on both training set (AUC: 0.9308) and independent validation dataset (AUC: 0.9323). Compared with previously published predictors, O-GlyThr achieved the highest ACC of 0.8475 on the independent test dataset. These results demonstrated the high competency of our predictor in identifying O-glycosites on threonine residues. Furthermore, a user-friendly webserver named O-GlyThr (http://cbcb.cdutcm.edu.cn/O-GlyThr/) was developed to assist glycobiologists in the research associated with glycosylation structure and function.
Collapse
Affiliation(s)
- Hua Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China; School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Qiang Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou 646000, China
| | - Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.
| |
Collapse
|
26
|
Liang C, Chiang AWT, Lewis NE. GlycoMME, a Markov modeling platform for studying N-glycosylation biosynthesis from glycomics data. STAR Protoc 2023; 4:102244. [PMID: 37086409 PMCID: PMC10160804 DOI: 10.1016/j.xpro.2023.102244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 03/07/2023] [Accepted: 03/24/2023] [Indexed: 04/23/2023] Open
Abstract
Variations in N-glycosylation, which is crucial to glycoprotein functions, impact many diseases and the safety and efficacy of biotherapeutic drugs. Here, we present a protocol for using GlycoMME (Glycosylation Markov Model Evaluator) to study N-glycosylation biosynthesis from glycomics data. We describe steps for annotating glycomics data and quantifying perturbations to N-glycan biosynthesis with interpretable models. We then detail procedures to predict the impact of mutations in disease or potential glycoengineering strategies in drug development. For complete details on the use and execution of this protocol, please refer to Liang et al. (2020).1.
Collapse
Affiliation(s)
- Chenguang Liang
- Department of Pediatrics, University of California, San Diego, La Jolla, San Diego, CA 92130, USA; Department of Bioengineering, University of California, San Diego, La Jolla, San Diego, CA 92130, USA
| | - Austin W T Chiang
- Department of Pediatrics, University of California, San Diego, La Jolla, San Diego, CA 92130, USA.
| | - Nathan E Lewis
- Department of Pediatrics, University of California, San Diego, La Jolla, San Diego, CA 92130, USA; Department of Bioengineering, University of California, San Diego, La Jolla, San Diego, CA 92130, USA.
| |
Collapse
|
27
|
Kim A, Kim J, Park CS, Jin M, Kang M, Moon C, Kim M, Kim J, Yang S, Jang L, Jang JY, Kim HH. Peptide-N-glycosidase F or A treatment and procainamide-labeling for identification and quantification of N-glycans in two types of mammalian glycoproteins using UPLC and LC-MS/MS. J Chromatogr B Analyt Technol Biomed Life Sci 2023; 1214:123538. [PMID: 36493594 DOI: 10.1016/j.jchromb.2022.123538] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 11/08/2022] [Accepted: 11/11/2022] [Indexed: 11/18/2022]
Abstract
BACKGROUND N-glycans in glycoproteins can affect physicochemical properties of proteins; however, some reported N-glycan structures are inconsistent depending on the type of glycoprotein or the preparation methods. OBJECTIVE To obtain consistent results for qualitative and quantitative analyses of N-glycans, N-glycans obtained by different preparation methods were compared for two types of mammalian glycoproteins. METHODS N-glycans are released by peptide-N-glycosidase F (PF) or A (PA) from two model mammalian glycoproteins, bovine fetuin (with three glycosylation sites) and human IgG (with a single glycosylation site), and labeled with a fluorescent tag [2-aminobenzamide (AB) or procainamide (ProA)]. The structure and quantity of each N-glycan were determined using UPLC and LC-MS/MS. RESULTS The 21 N-glycans in fetuin and another 21 N-glycans in IgG by either PF-ProA or PA-ProA were identified using LC-MS/MS. The N-glycans in fetuin (8-13 N-glycans were previously reported) and in IgG (19 N-glycans were previously reported), which could not be identified by using the widely used PF-AB, were all identified by using PF-ProA or PA-ProA. The quantities (%) of the N-glycans (>0.1 %) relative to the total amount of N-glycans (100 %) obtained by AB- and ProA-labeling using LC-MS/MS had a similar tendency. However, the absolute quantities (pmol) of the N-glycans estimated using UPLC and LC-MS/MS were more efficiently determined with ProA-labeling than with AB-labeling. Thus, PF-ProA or PA-ProA allows for more effective identification and quantification of N-glycans than PF-AB in glycoprotein, particularly bovine fetuin. This study is the first comparative analysis for the identification and relative and absolute quantification of N-glycans in glycoproteins with PF-ProA and PA-ProA using UPLC and LC-MS/MS.
Collapse
Affiliation(s)
- Ahyeon Kim
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Jeongeun Kim
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Chi Soo Park
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Mijung Jin
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Minju Kang
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Chulmin Moon
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Mirae Kim
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Jieun Kim
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Subin Yang
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Leeseul Jang
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Ji Yeon Jang
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea
| | - Ha Hyung Kim
- Biotherapeutics and Glycomics Laboratory, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea; Department of Global Innovative Drugs, Graduate School of Chung-Ang University, 84, Heukseok-ro, Dongjak-gu, Seoul 06974, Republic of Korea.
| |
Collapse
|
28
|
Abstract
Artificial intelligence (AI) methods have been and are now being increasingly integrated in prediction software implemented in bioinformatics and its glycoscience branch known as glycoinformatics. AI techniques have evolved in the past decades, and their applications in glycoscience are not yet widespread. This limited use is partly explained by the peculiarities of glyco-data that are notoriously hard to produce and analyze. Nonetheless, as time goes, the accumulation of glycomics, glycoproteomics, and glycan-binding data has reached a point where even the most recent deep learning methods can provide predictors with good performance. We discuss the historical development of the application of various AI methods in the broader field of glycoinformatics. A particular focus is placed on shining a light on challenges in glyco-data handling, contextualized by lessons learnt from related disciplines. Ending on the discussion of state-of-the-art deep learning approaches in glycoinformatics, we also envision the future of glycoinformatics, including development that need to occur in order to truly unleash the capabilities of glycoscience in the systems biology era.
Collapse
Affiliation(s)
- Daniel Bojar
- Department
of Chemistry and Molecular Biology, University
of Gothenburg, Gothenburg 41390, Sweden
- Wallenberg
Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden
| | - Frederique Lisacek
- Proteome
Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
- Computer
Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
| |
Collapse
|
29
|
Abstract
Bats are recognized as important reservoirs of viruses deadly to other mammals, including humans. These infections are typically nonpathogenic in bats, raising questions about host response differences that might exist between bats and other mammals. Tetherin is a restriction factor which inhibits the release of a diverse range of viruses from host cells, including retroviruses, coronaviruses, filoviruses, and paramyxoviruses, some of which are deadly to humans and transmitted by bats. Here, we characterize the tetherin genes from 27 bat species, revealing that they have evolved under strong selective pressure, and that fruit bats and vesper bats express unique structural variants of the tetherin protein. Tetherin was widely and variably expressed across fruit bat tissue types and upregulated in spleen tissue when stimulated with Toll-like receptor agonists. The expression of two computationally predicted splice isoforms of fruit bat tetherin was verified. We identified an additional third unique splice isoform which includes a C-terminal region that is not homologous to known mammalian tetherin variants but was functionally capable of restricting the release of filoviral virus-like particles. We also report that vesper bats possess and express at least five tetherin genes, including structural variants, more than any other mammal reported to date. These findings support the hypothesis of differential antiviral gene evolution in bats relative to other mammals. IMPORTANCE Bats are an important host of various viruses which are deadly to humans and other mammals but do not cause outward signs of illness in bats. Furthering our understanding of the unique features of the immune system of bats will shed light on how they tolerate viral infections, potentially informing novel antiviral strategies in humans and other animals. This study examines the antiviral protein tetherin, which prevents viral particles from escaping their host cell. Analysis of tetherin from 27 bat species reveals that it is under strong evolutionary pressure, and we show that multiple bat species have evolved to possess more tetherin genes than other mammals, some of which encode structurally unique tetherins capable of activity against different viral particles. These data suggest that bat tetherin plays a potentially broad and important role in the management of viral infections in bats.
Collapse
|
30
|
Avery C, Patterson J, Grear T, Frater T, Jacobs DJ. Protein Function Analysis through Machine Learning. Biomolecules 2022; 12:1246. [PMID: 36139085 PMCID: PMC9496392 DOI: 10.3390/biom12091246] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 08/22/2022] [Accepted: 08/31/2022] [Indexed: 11/16/2022] Open
Abstract
Machine learning (ML) has been an important arsenal in computational biology used to elucidate protein function for decades. With the recent burgeoning of novel ML methods and applications, new ML approaches have been incorporated into many areas of computational biology dealing with protein function. We examine how ML has been integrated into a wide range of computational models to improve prediction accuracy and gain a better understanding of protein function. The applications discussed are protein structure prediction, protein engineering using sequence modifications to achieve stability and druggability characteristics, molecular docking in terms of protein-ligand binding, including allosteric effects, protein-protein interactions and protein-centric drug discovery. To quantify the mechanisms underlying protein function, a holistic approach that takes structure, flexibility, stability, and dynamics into account is required, as these aspects become inseparable through their interdependence. Another key component of protein function is conformational dynamics, which often manifest as protein kinetics. Computational methods that use ML to generate representative conformational ensembles and quantify differences in conformational ensembles important for function are included in this review. Future opportunities are highlighted for each of these topics.
Collapse
Affiliation(s)
- Chris Avery
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - John Patterson
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Tyler Grear
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Theodore Frater
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Donald J. Jacobs
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
31
|
Wang M, Li F, Wu H, Liu Q, Li S. PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest. Interdiscip Sci 2022; 14:697-711. [PMID: 35488998 DOI: 10.1007/s12539-022-00520-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/12/2022]
Abstract
Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.
Collapse
Affiliation(s)
- Miao Wang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC, 3000, Australia
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250100, Shandong, China
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| |
Collapse
|
32
|
Puranik A, Dandekar P, Jain R. Exploring the potential of machine learning for more efficient development and production of biopharmaceuticals. Biotechnol Prog 2022; 38:e3291. [PMID: 35918873 DOI: 10.1002/btpr.3291] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 06/20/2022] [Accepted: 07/31/2022] [Indexed: 11/10/2022]
Abstract
Principles of Industry 4.0 direct us to predict how pharmaceutical operations and regulations may exist with automation, digitization, artificial intelligence (AI), and real time data acquisition. Machine learning (ML), a sub-discipline of AI, involves the use of statistical tools to extract the desired information either through understanding the underlying patterns in the information or by development of mathematical relationships among the critical process parameters (CPPs) and critical quality attributes (CQAs) of biopharmaceuticals. ML is still in its infancy for directly supporting the quality-by-design based development and manufacturing of biopharmaceuticals. However, adoption of ML-based models in place of conventional multi-variate-data-analysis (MVDA) is increasing with the accumulation of large-scale data. This has been majorly contributed by the real-time monitoring of process variables and quality attributes of products through the implementation of process analytical technology in biopharmaceutical manufacturing. All aspects of healthcare, from drug design to product distribution, are complex and multidimensional. Thus, ML-based approaches are being applied to achieve sophistication, accuracy, flexibility and agility in all these areas. This review discusses the potential of ML for addressing the complex issues in diverse areas of biopharmaceutical development, such as biopharmaceuticals design and assessment of early stage development, upstream and downstream process development, analysis, characterization and prediction of post translational modifications (PTMs), formulation and stability studies. Moreover, the challenges in acquisition, cleaning and structuring the bioprocess data, which is one of the major hurdles in implementation of ML in biopharma industry, have also been discussed. Regulatory perspectives on implementation of AI/ML in the biopharma sector have also been briefly discussed. This article is a bird's eye view on the recent developments and applications of ML in overcoming the challenges for adopting "Industry - 4.0" in the biopharma industry.
Collapse
Affiliation(s)
- Amita Puranik
- Department of Chemical Engineering, Institute of Chemical Technology, Matunga, Mumbai, India
| | - Prajakta Dandekar
- Department of Pharmaceutical Sciences and Technology, Institute of Chemical Technology, Matunga, Mumbai, India
| | - Ratnesh Jain
- Department of Chemical Engineering, Institute of Chemical Technology, Matunga, Mumbai, India
| |
Collapse
|
33
|
Ghasabi F, Hashempour A, Khodadad N, Bemani S, Keshani P, Shekiba MJ, Hasanshahi Z. First report of computational protein-ligand docking to evaluate susceptibility to HIV integrase inhibitors in HIV-infected Iranian patients. Biochem Biophys Rep 2022; 30:101254. [PMID: 35368742 PMCID: PMC8968007 DOI: 10.1016/j.bbrep.2022.101254] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/16/2022] [Accepted: 03/17/2022] [Indexed: 12/04/2022] Open
Abstract
Background Iran has recently included integrase (INT) inhibitors (INTIs) in the first-line treatment regimen in human immunodeficiency virus (HIV)-infected patients. However, there is no bioinformatics data to elaborate the impact of resistance-associated mutations (RAMs) and naturally occurring polymorphisms (NOPs) on INTIs treatment outcome in Iranian patients. Method In this cross-sectional survey, 850 HIV-1-infected patients enrolled; of them, 78 samples had successful sequencing results for INT gene. Several analyses were performed including docking screening, genotypic resistance, secondary/tertiary structures, post-translational modification (PTM), immune epitopes, etc. Result The average docking energy (E value) of different samples with elvitegravir (EVG) and raltegravir (RAL) was more than other INTIs. Phylogenetic tree analysis and Stanford HIV Subtyping program revealed HIV-1 CRF35-AD was the predominant subtype (94.9%) in our cases; in any event, online subtyping tools confirmed A1 as the most frequent subtype. For the first time, CRF-01B and BF were identified as new subtypes in Iran. Decreased CD4 count was associated with several factors: poor or unstable adherence, naïve treatment, and drug user status. Conclusion As the first bioinformatic report on HIV-integrase from Iran, this study indicates that EVG and RAL are the optimal INTIs in first-line antiretroviral therapy (ART) in Iranian patients. Some conserved motifs and specific amino acids in INT-protein binding sites have characterized that mutation(s) in them may disrupt INT-drugs interaction and cause a significant loss in susceptibility to INTIs. Good adherence, treatment of naïve patients, and monitoring injection drug users are fundamental factors to control HIV infection in Iran effectively.
Collapse
Key Words
- Antiretroviral therapy, ART
- Behavioral Diseases Consultation Center, BDCC
- Bictegravir, BIC
- C-terminal domain, CTD
- CRF35-AD
- Cabotegravir, CBT
- Catalytic core domain, CCD
- Dolutegravir, DTG
- Drug resistance
- Elvitegravir, EVG
- Grand average hydropathy, GRAVY
- HIV
- Human immunodeficiency virus, HIV
- INT, Integrase
- INTIs, Integrase inhibitors (INTIs)
- Injecting drug users, IDUs
- Integrase
- Integrase inhibitors
- Molecular docking
- N-terminal domain, NTD
- Naturally occurring polymorphisms, NOPs
- Post-translational modification, PTM
- Raltegravir, RAL
- Resistance-associated mutations, RAMs
Collapse
Affiliation(s)
- Farzane Ghasabi
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Ava Hashempour
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Nastaran Khodadad
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Soudabeh Bemani
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Parisa Keshani
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Mohamad Javad Shekiba
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Zahra Hasanshahi
- Shiraz HIV/AIDS Research Center, Institute of Health, Shiraz University of Medical Sciences, Shiraz, Iran
| |
Collapse
|
34
|
Puranik A, Saldanha M, Chirmule N, Dandekar P, Jain R. Advanced strategies in glycosylation prediction and control during biopharmaceutical development: Avenues toward Industry 4.0. Biotechnol Prog 2022; 38:e3283. [PMID: 35752935 DOI: 10.1002/btpr.3283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 05/31/2022] [Accepted: 06/17/2022] [Indexed: 11/09/2022]
Abstract
Glycosylation has been shown to define the safety and efficacy of biopharmaceuticals, thus classified as a critical quality attribute. However, controlling glycan heterogeneity has always been a major challenge owing to the multi-variate factors that govern the glycosylation process. Conventional approaches for controlling glycosylation such as gene editing and metabolic control have succeeded in obtaining desired glycan profiles in accordance with the Quality by Design paradigm. Nonetheless, the development of smart algorithms and omics-enabled complete cell characterization have made it possible to predict glycan profiles beforehand, and manipulate process variables accordingly. This review thus discusses the various approaches available for control and prediction of glycosylation in biopharmaceuticals. Further, the futuristic goal of integrating such technologies is discussed in order to attain an automated and digitized continuous bioprocess for control of glycosylation. Given, control of a process as complex as glycosylation requires intense monitoring intervention, we examine the current technologies that enable automation. Finally, we discuss the challenges and the technological gap that currently limits incorporation of an automated process in routine bio-manufacturing, with a glimpse into the economic bearing. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Amita Puranik
- Department of Chemical Engineering, Institute of Chemical Technology, Matunga, Mumbai, India
| | - Marianne Saldanha
- Department of Chemical Engineering, Institute of Chemical Technology, Matunga, Mumbai, India
| | | | - Prajakta Dandekar
- Department of Pharmaceutical Sciences and Technology, Institute of Chemical Technology, Matunga, Mumbai, India
| | - Ratnesh Jain
- Department of Chemical Engineering, Institute of Chemical Technology, Matunga, Mumbai, India
| |
Collapse
|
35
|
An Effective Deep Learning-Based Architecture for Prediction of N7-Methylguanosine Sites in Health Systems. ELECTRONICS 2022. [DOI: 10.3390/electronics11121917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
N7-methylguanosine (m7G) is one of the most important epigenetic modifications found in rRNA, mRNA, and tRNA, and performs a promising role in gene expression regulation. Owing to its significance, well-equipped traditional laboratory-based techniques have been performed for the identification of N7-methylguanosine (m7G). Consequently, these approaches were found to be time-consuming and cost-ineffective. To move on from these traditional approaches to predict N7-methylguanosine sites with high precision, the concept of artificial intelligence has been adopted. In this study, an intelligent computational model called N7-methylguanosine-Long short-term memory (m7G-LSTM) is introduced for the prediction of N7-methylguanosine sites. One-hot encoding and word2vec feature schemes are used to express the biological sequences while the LSTM and CNN algorithms have been employed for classification. The proposed “m7G-LSTM” model obtained an accuracy value of 95.95%, a specificity value of 95.94%, a sensitivity value of 95.97%, and Matthew’s correlation coefficient (MCC) value of 0.919. The proposed predictive m7G-LSTM model has significantly achieved better outcomes than previous models in terms of all evaluation parameters. The proposed m7G-LSTM computational system aims to support the drug industry and help researchers in the fields of bioinformatics to enhance innovation for the prediction of the behavior of N7-methylguanosine sites.
Collapse
|
36
|
Khalili E, Ramazi S, Ghanati F, Kouchaki S. Predicting protein phosphorylation sites in soybean using interpretable deep tabular learning network. Brief Bioinform 2022; 23:bbac015. [PMID: 35152280 DOI: 10.1093/bib/bbac015] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 12/17/2021] [Accepted: 01/12/2022] [Indexed: 12/17/2023] Open
Abstract
Phosphorylation of proteins is one of the most significant post-translational modifications (PTMs) and plays a crucial role in plant functionality due to its impact on signaling, gene expression, enzyme kinetics, protein stability and interactions. Accurate prediction of plant phosphorylation sites (p-sites) is vital as abnormal regulation of phosphorylation usually leads to plant diseases. However, current experimental methods for PTM prediction suffers from high-computational cost and are error-prone. The present study develops machine learning-based prediction techniques, including a high-performance interpretable deep tabular learning network (TabNet) to improve the prediction of protein p-sites in soybean. Moreover, we use a hybrid feature set of sequential-based features, physicochemical properties and position-specific scoring matrices to predict serine (Ser/S), threonine (Thr/T) and tyrosine (Tyr/Y) p-sites in soybean for the first time. The experimentally verified p-sites data of soybean proteins are collected from the eukaryotic phosphorylation sites database and database post-translational modification. We then remove the redundant set of positive and negative samples by dropping protein sequences with >40% similarity. It is found that the developed techniques perform >70% in terms of accuracy. The results demonstrate that the TabNet model is the best performing classifier using hybrid features and with window size of 13, resulted in 78.96 and 77.24% sensitivity and specificity, respectively. The results indicate that the TabNet method has advantages in terms of high-performance and interpretability. The proposed technique can automatically analyze the data without any measurement errors and any human intervention. Furthermore, it can be used to predict putative protein p-sites in plants effectively. The collected dataset and source code are publicly deposited at https://github.com/Elham-khalili/Soybean-P-sites-Prediction.
Collapse
Affiliation(s)
- Elham Khalili
- Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran
| | - Faezeh Ghanati
- Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran
| | - Samaneh Kouchaki
- Department of Electrical and Electronic Engineering, .Faculty of Engineering and Physical Sciences, Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, UK
| |
Collapse
|
37
|
Wang X, Li F, Xu J, Rong J, Webb GI, Ge Z, Li J, Song J. ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief Bioinform 2022; 23:bbac031. [PMID: 35176756 PMCID: PMC8921646 DOI: 10.1093/bib/bbac031] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 01/10/2022] [Accepted: 01/22/2022] [Indexed: 12/15/2022] Open
Abstract
Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jing Xu
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Jia Rong
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Department of Data Science and AI, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
38
|
Wei PJ, Pang ZZ, Jiang LJ, Tan D, Su Y, Zheng CH. Promoter Prediction in Nannochloropsis Based on Densely Connected Convolutional Neural Networks. Methods 2022; 204:38-46. [DOI: 10.1016/j.ymeth.2022.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 03/03/2022] [Accepted: 03/28/2022] [Indexed: 10/18/2022] Open
|
39
|
Structure of infective Getah virus at 2.8 Å resolution determined by cryo-electron microscopy. Cell Discov 2022; 8:12. [PMID: 35149682 PMCID: PMC8832435 DOI: 10.1038/s41421-022-00374-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 01/03/2022] [Indexed: 11/30/2022] Open
Abstract
Getah virus (GETV), a member of the genus alphavirus, is a mosquito-borne pathogen that can cause pyrexia and reproductive losses in animals. Although antibodies to GETV have been found in over 10% of healthy people, there are no reports of clinical symptoms associated with GETV. The biological and pathological properties of GETV are largely unknown and antiviral or vaccine treatments against GETV are still unavailable due to a lack of knowledge of the structure of the GETV virion. Here, we present the structure of infective GETV at a resolution of 2.8 Å with the atomic models of the capsid protein and the envelope glycoproteins E1 and E2. We have identified numerous glycosylation and S-acylation sites in E1 and E2. The surface-exposed glycans indicate a possible impact on viral immune evasion and host cell invasion. The S-acylation sites might be involved in stabilizing the transmembrane assembly of E1 and E2. In addition, a cholesterol and a phospholipid molecule are observed in a transmembrane hydrophobic pocket, together with two more cholesterols surrounding the pocket. The cholesterol and phospholipid stabilize the hydrophobic pocket in the viral envelope membrane. The structural information will assist structure-based antiviral and vaccine screening, design, and optimization.
Collapse
|
40
|
Li F, Guo X, Xiang D, Pitt ME, Bainomugisa A, Coin LJ. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022; 20:662-674. [PMID: 35140886 PMCID: PMC8804200 DOI: 10.1016/j.csbj.2022.01.019] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 01/09/2022] [Accepted: 01/18/2022] [Indexed: 12/18/2022] Open
Abstract
Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
Collapse
Affiliation(s)
- Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | - Xudong Guo
- School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China
| | - Dongxu Xiang
- Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia
| | - Miranda E. Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | | | - Lachlan J.M. Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| |
Collapse
|
41
|
Schultz CJ, Wu Y, Baumann U. A targeted bioinformatics approach identifies highly variable cell surface proteins that are unique to Glomeromycotina. MYCORRHIZA 2022; 32:45-66. [PMID: 35031894 PMCID: PMC8786786 DOI: 10.1007/s00572-021-01066-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 12/24/2021] [Indexed: 06/14/2023]
Abstract
Diversity in arbuscular mycorrhizal fungi (AMF) contributes to biodiversity and resilience in natural environments and healthy agricultural systems. Functional complementarity exists among species of AMF in symbiosis with their plant hosts, but the molecular basis of this is not known. We hypothesise this is in part due to the difficulties that current sequence assembly methodologies have assembling sequences for intrinsically disordered proteins (IDPs) due to their low sequence complexity. IDPs are potential candidates for functional complementarity because they often exist as extended (non-globular) proteins providing additional amino acids for molecular interactions. Rhizophagus irregularis arabinogalactan-protein-like proteins (AGLs) are small secreted IDPs with no known orthologues in AMF or other fungi. We developed a targeted bioinformatics approach to identify highly variable AGLs/IDPs in RNA-sequence datasets. The approach includes a modified multiple k-mer assembly approach (Oases) to identify candidate sequences, followed by targeted sequence capture and assembly (mirabait-mira). All AMF species analysed, including the ancestral family Paraglomeraceae, have small families of proteins rich in disorder promoting amino acids such as proline and glycine, or glycine and asparagine. Glycine- and asparagine-rich proteins also were found in Geosiphon pyriformis (an obligate symbiont of a cyanobacterium), from the same subphylum (Glomeromycotina) as AMF. The sequence diversity of AGLs likely translates to functional diversity, based on predicted physical properties of tandem repeats (elastic, amyloid, or interchangeable) and their broad pI ranges. We envisage that AGLs/IDPs could contribute to functional complementarity in AMF through processes such as self-recognition, retention of nutrients, soil stability, and water movement.
Collapse
Affiliation(s)
- Carolyn J Schultz
- School of Agriculture, Food, and Wine, Waite Research Institute, University of Adelaide, Adelaide, SA, Australia.
| | - Yue Wu
- School of Agriculture, Food, and Wine, Waite Research Institute, University of Adelaide, Adelaide, SA, Australia
| | - Ute Baumann
- School of Agriculture, Food, and Wine, Waite Research Institute, University of Adelaide, Adelaide, SA, Australia
| |
Collapse
|
42
|
Taherzadeh G, Campbell M, Zhou Y. Computational Prediction of N- and O-Linked Glycosylation Sites for Human and Mouse Proteins. Methods Mol Biol 2022; 2499:177-186. [PMID: 35696081 DOI: 10.1007/978-1-0716-2317-6_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Protein glycosylation is one of the most complex posttranslational modifications (PTM) that play a fundamental role in protein function. Identification and annotation of these sites using experimental approaches are challenging and time consuming. Hence, there is a demand to build fast and efficient computational methods to address this problem. Here, we present the SPRINT-Gly framework containing the largest dataset and a prediction model of glycosylation sites for a given protein sequence. In this framework, we construct a large dataset containing N- and O-linked glycosylation sites of human and mouse proteins, collected from different sources. We then introduce the SPRINT-Gly method to predict putative N- and O-linked sites. SPRINT-Gly is a machine learning-based approach consisting of a number of trained predictive models for glycosylation sites in both human and mouse proteins, separately. The method is built by incorporating sequence-based, predicted structural, and physicochemical information of the neighboring residues of each N- and O-linked glycosylation site and by training deep learning neural network and support vector machine as classifiers. SPRINT-Gly outperformed other existing methods by achieving 18% and 50% higher Matthew's correlation coefficient for N- and O-linked glycosylation site prediction, respectively. SPRINT-Gly is publicly available as an online and stand-alone predictor at https://sparks-lab.org/server/sprint-gly/ .
Collapse
Affiliation(s)
- Ghazaleh Taherzadeh
- Department of Mathematics and Computer Science, Wilkes University, Wilkes-Barre, PA, USA.
| | - Matthew Campbell
- Institute for Glycomics, Griffith University, Southport, QLD, Australia
| | - Yaoqi Zhou
- Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China
| |
Collapse
|
43
|
Aoki-Kinoshita KF. Functions of Glycosylation and Related Web Resources for Its Prediction. Methods Mol Biol 2022; 2499:135-144. [PMID: 35696078 DOI: 10.1007/978-1-0716-2317-6_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Glycosylation involves the attachment of carbohydrate sugar chains, or glycans, onto an amino acid residue of a protein. These glycans are often branched structures and serve to modulate the function of proteins. Glycans are synthesized through a complex process of enzymatic reactions that occur in the Golgi apparatus in mammalian systems. Because there is currently no sequencer for glycans, technologies such as mass spectrometry is used to characterize glycans in a biological sample to ascertain its glycome. This is a tedious process that requires high levels of expertise and equipment. Thus, the enzymes that work on glycans, called glycogenes or glycoenzymes, have been studied to better understand glycan function. With the development of glycan-related databases and a glycan repository, bioinformatics approaches have attempted to predict the glycosylation pathway and the glycosylation sites on proteins. This chapter introduces these methods and related Web resources for understanding glycan function.
Collapse
|
44
|
Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules 2021; 26:molecules26237314. [PMID: 34885895 PMCID: PMC8658957 DOI: 10.3390/molecules26237314] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 11/22/2021] [Accepted: 11/26/2021] [Indexed: 12/21/2022] Open
Abstract
Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- School of Computing, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA;
| | | | - Doina Caragea
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA;
| | - Dukka B. KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA
- Correspondence: ; Tel.: +1-906-487-1657
| |
Collapse
|
45
|
Zhu Y, Yin S, Zheng J, Shi Y, Jia C. O-glycosylation site prediction for Homo sapiens by combining properties and sequence features with support vector machine. J Bioinform Comput Biol 2021; 20:2150029. [PMID: 34806952 DOI: 10.1142/s0219720021500293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
O-glycosylation is a protein posttranslational modification important in regulating almost all cells. It is related to a large number of physiological and pathological phenomena. Recognizing O-glycosylation sites is the key to further investigating the molecular mechanism of protein posttranslational modification. This study aimed to collect a reliable dataset on Homo sapiens and develop an O-glycosylation predictor for Homo sapiens, named Captor, through multiple features. A random undersampling method and a synthetic minority oversampling technique were employed to deal with imbalanced data. In addition, the Kruskal-Wallis (K-W) test was adopted to optimize feature vectors and improve the performance of the model. A support vector machine, due to its optimal performance, was used to train and optimize the final prediction model after a comprehensive comparison of various classifiers in traditional machine learning methods and deep learning. On the independent test set, Captor outperformed the existing O-glycosylation tool, suggesting that Captor could provide more instructive guidance for further experimental research on O-glycosylation. The source code and datasets are available at https://github.com/YanZhu06/Captor/.
Collapse
Affiliation(s)
- Yan Zhu
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Shuwan Yin
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Yixia Shi
- School of Mathematics and Statistics, Lingnan Normal University, Zhanjiang 524048, P. R. China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| |
Collapse
|
46
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
47
|
Fan Y, Wang W. Using multi-layer perceptron to identify origins of replication in eukaryotes via informative features. BMC Bioinformatics 2021; 22:516. [PMID: 34688247 PMCID: PMC8542328 DOI: 10.1186/s12859-021-04431-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 10/04/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. More importantly, accurately identifying the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors, while the traditional biological experimental methods are time-consuming and laborious. RESULTS We carried out research on the origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species. Throughout the experiment, we collected data from 7 species, including Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Kluyveromyces lactis, Pichia pastoris and Schizosaccharomyces pombe. In addition to the commonly used sequence feature extraction methods PseKNC-II and Base-content, we designed a feature extraction method based on TF-IDF. Then the two-step method was utilized for feature selection. After comparing a variety of traditional machine learning classification models, the multi-layer perceptron was employed as the classification algorithm. Ultimately, the data and codes involved in the experiment are available at https://github.com/Sarahyouzi/EukOriginPredict . CONCLUSIONS The prediction accuracy of the training set of the above-mentioned seven species after 100 times fivefold cross validation reach 92.60%, 90.80%, 91.22%, 96.15%, 96.72%, 99.86%, 96.72%, respectively. It denotes that compared with other methods, the methods we designed could accomplish superior performance. In addition, our experiments reveals that the models of multiple species could predict each other with high accuracy, and the results of STREME shows that they have a certain common motif.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| | - Wanru Wang
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| |
Collapse
|
48
|
Virág D, Kremmer T, Lőrincz K, Kiss N, Jobbágy A, Bozsányi S, Gulyás L, Wikonkál N, Schlosser G, Borbély A, Huba Z, Dalmadi Kiss B, Antal I, Ludányi K. Altered Glycosylation of Human Alpha-1-Acid Glycoprotein as a Biomarker for Malignant Melanoma. Molecules 2021; 26:molecules26196003. [PMID: 34641547 PMCID: PMC8513036 DOI: 10.3390/molecules26196003] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 10/01/2021] [Indexed: 11/16/2022] Open
Abstract
A high-resolution HILIC-MS/MS method was developed to analyze anthranilic acid derivatives of N-glycans released from human serum alpha-1-acid glycoprotein (AGP). The method was applied to samples obtained from 18 patients suffering from high-risk malignant melanoma as well as 19 healthy individuals. It enabled the identification of 102 glycan isomers separating isomers that differ only in sialic acid linkage (α-2,3, α-2,6) or in fucose positions (core, antenna). Comparative assessment of the samples revealed that upregulation of certain fucosylated glycans and downregulation of their nonfucosylated counterparts occurred in cancer patients. An increased ratio of isomers with more α-2,6-linked sialic acids was also observed. Linear discriminant analysis (LDA) combining 10 variables with the highest discriminatory power was employed to categorize the samples based on their glycosylation pattern. The performance of the method was tested by cross-validation, resulting in an overall classification success rate of 96.7%. The approach presented here is significantly superior to serological marker S100B protein in terms of sensitivity and negative predictive power in the population studied. Therefore, it may effectively support the diagnosis of malignant melanoma as a biomarker.
Collapse
Affiliation(s)
- Dávid Virág
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
| | - Tibor Kremmer
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
| | - Kende Lőrincz
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Norbert Kiss
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Antal Jobbágy
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Szabolcs Bozsányi
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Lili Gulyás
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Norbert Wikonkál
- Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Mária utca. 41., H-1085 Budapest, Hungary; (K.L.); (N.K.); (A.J.); (S.B.); (L.G.); (N.W.)
| | - Gitta Schlosser
- MTA-ELTE Lendület Ion Mobility Mass Spectrometry Research Group, Institute of Chemistry, Faculty of Science, ELTE Eötvös Loránd University, Pázmány Péter sétány 1/A, H-1117 Budapest, Hungary; (G.S.); (A.B.)
| | - Adina Borbély
- MTA-ELTE Lendület Ion Mobility Mass Spectrometry Research Group, Institute of Chemistry, Faculty of Science, ELTE Eötvös Loránd University, Pázmány Péter sétány 1/A, H-1117 Budapest, Hungary; (G.S.); (A.B.)
| | - Zsófia Huba
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
| | - Borbála Dalmadi Kiss
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
| | - István Antal
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
| | - Krisztina Ludányi
- Department of Pharmaceutics, Semmelweis University, Hőgyes Endre utca 7., H-1092 Budapest, Hungary; (D.V.); (T.K.); (Z.H.); (B.D.K.); (I.A.)
- Correspondence:
| |
Collapse
|
49
|
Zhang S, Zhao L, Zheng CH, Xia J. A feature-based approach to predict hot spots in protein-DNA binding interfaces. Brief Bioinform 2021; 21:1038-1046. [PMID: 30957840 DOI: 10.1093/bib/bbz037] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 02/20/2019] [Accepted: 03/07/2019] [Indexed: 12/21/2022] Open
Abstract
DNA-binding hot spot residues of proteins are dominant and fundamental interface residues that contribute most of the binding free energy of protein-DNA interfaces. As experimental methods for identifying hot spots are expensive and time consuming, computational approaches are urgently required in predicting hot spots on a large scale. In this work, we systematically assessed a wide variety of 114 features from a combination of the protein sequence, structure, network and solvent accessible information and their combinations along with various feature selection strategies for hot spot prediction. We then trained and compared four commonly used machine learning models, namely, support vector machine (SVM), random forest, Naïve Bayes and k-nearest neighbor, for the identification of hot spots using 10-fold cross-validation and the independent test set. Our results show that (1) features based on the solvent accessible surface area have significant effect on hot spot prediction; (2) different but complementary features generally enhance the prediction performance; and (3) SVM outperforms other machine learning methods on both training and independent test sets. In an effort to improve predictive performance, we developed a feature-based method, namely, PrPDH (Prediction of Protein-DNA binding Hot spots), for the prediction of hot spots in protein-DNA binding interfaces using SVM based on the selected 10 optimal features. Comparative results on benchmark data sets indicate that our predictor is able to achieve generally better performance in predicting hot spots compared to the state-of-the-art predictors. A user-friendly web server for PrPDH is well established and is freely available at http://bioinfo.ahu.edu.cn:8080/PrPDH.
Collapse
Affiliation(s)
- Sijia Zhang
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Le Zhao
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Chun-Hou Zheng
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
50
|
Akmal MA, Hussain W, Rasool N, Khan YD, Khan SA, Chou KC. Using CHOU'S 5-Steps Rule to Predict O-Linked Serine Glycosylation Sites by Blending Position Relative Features and Statistical Moment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2045-2056. [PMID: 31985438 DOI: 10.1109/tcbb.2020.2968441] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Glycosylation of proteins in eukaryote cells is an important and complicated post-translation modification due to its pivotal role and association with crucial physiological functions within most of the proteins. Identification of glycosylation sites in a polypeptide chain is not an easy task due to multiple impediments. Analytical identification of these sites is expensive and laborious. There is a dire need to develop a reliable computational method for precise determination of such sites which can help researchers to save time and effort. Herein, we propose a novel predictor namely iGlycoS-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. The self-consistency results show that the accuracy revealed by the model using the benchmark dataset for prediction of O-linked glycosylation having serine sites is 98.8 percent. The overall accuracy of predictor achieved through 10-fold cross validation by combining the positive and negative results is 97.2 percent. The overall accuracy achieved through Jackknife test is 96.195 percent by aggregating of all the prediction results. Thus the proposed predictor can help in predicting the O-linked glycosylated serine sites in an efficient and accurate way. The overall results show that the accuracy of the iGlycoS-PseAAC is higher than the existing tools.
Collapse
|