1
|
Li J, Zhao Z, Tai C, Sun T, Tan L, Li X, He W, Li H, Zhang J. VirusImmu: a novel ensemble machine learning approach for viral immunogenicity prediction. Brief Funct Genomics 2025; 24:elaf008. [PMID: 40323648 PMCID: PMC12051847 DOI: 10.1093/bfgp/elaf008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 03/01/2025] [Accepted: 04/07/2025] [Indexed: 05/07/2025] Open
Abstract
The viruses threats provoke concerns regarding their sustained epidemic transmission, making the development of vaccines particularly important. In the prolonged and costly process of vaccine development, the most important initial step is to identify protective immunogens. Machine learning (ML) approaches are productive in analyzing big data such as microbial proteomes, and can remarkably reduce the cost of experimental work in developing novel vaccine candidates. We intensively evaluated the B cell epitope immunogenicity prediction power of eight commonly-used ML methods by random sampling cross validation on a large dataset consisting of known viral immunogens and non-immunogens we manually curated from the public domain. Extreme Gradient Boosting, K Nearest Neighbours, and Random Forest) showed the strongest predictive power. We then proposed a novel soft-voting based ensemble approach (VirusImmu), which demonstrated a powerful and stable capability for viral immunogenicity prediction across the test set and external test set irrespective of protein sequence length. VirusImmu was successfully applied to facilitate identifying linear B cell epitopes against African Swine Fever Virus as confirmed by indirect ELISA in vitro. In short, VirusImmu exhibited tremendous potentials in predicting immunogenicity of viral protein segments. It is freely accessible at https://github.com/zhangjbig/VirusImmu.
Collapse
Affiliation(s)
- Jing Li
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Centre for Biomedical Engineering, School of Engineering Medicine, School of Biological Science and Medical Engineering, Beihang University, 37 Xueyuan Road, Haidian Distirct, Beijing 100083, P. R. China
| | - Zhongpeng Zhao
- Department of Immunology, School of Basic Medical Sciences, Anhui Medical University, No 81 Meishan Road, Shushan District, Hefei 230032, China
| | - ChengZheng Tai
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Centre for Biomedical Engineering, School of Engineering Medicine, School of Biological Science and Medical Engineering, Beihang University, 37 Xueyuan Road, Haidian Distirct, Beijing 100083, P. R. China
| | - Ting Sun
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Centre for Biomedical Engineering, School of Engineering Medicine, School of Biological Science and Medical Engineering, Beihang University, 37 Xueyuan Road, Haidian Distirct, Beijing 100083, P. R. China
| | - Lingyun Tan
- Department of Immunology, School of Basic Medical Sciences, Anhui Medical University, No 81 Meishan Road, Shushan District, Hefei 230032, China
| | - Xinyu Li
- Department of Immunology, School of Basic Medical Sciences, Anhui Medical University, No 81 Meishan Road, Shushan District, Hefei 230032, China
| | - Wei He
- Department of Immunology, School of Basic Medical Sciences, Anhui Medical University, No 81 Meishan Road, Shushan District, Hefei 230032, China
| | - HongJun Li
- Department of Radiology, Beijing YouAn Hospital, Capital Medical University, No. 8 Xitoutiao, You'anmen wai, Fengtai District, Beijing 100069, China
| | - Jing Zhang
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Centre for Biomedical Engineering, School of Engineering Medicine, School of Biological Science and Medical Engineering, Beihang University, 37 Xueyuan Road, Haidian Distirct, Beijing 100083, P. R. China
| |
Collapse
|
2
|
Zhu L, Chen H, Yang S. LncSL: A Novel Stacked Ensemble Computing Tool for Subcellular Localization of lncRNA by Amino Acid-Enhanced Features and Two-Stage Automated Selection Strategy. Int J Mol Sci 2024; 25:13734. [PMID: 39769496 PMCID: PMC11678684 DOI: 10.3390/ijms252413734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 12/17/2024] [Accepted: 12/19/2024] [Indexed: 01/11/2025] Open
Abstract
Long non-coding RNA (lncRNA) is a non-coding RNA longer than 200 nucleotides, crucial for functions like cell cycle regulation and gene transcription. Accurate localization prediction from sequence information is vital for understanding lncRNA's biological roles. Computational methods offer an effective alternative to traditional experimental methods for annotating lncRNA subcellular positions. Existing machine learning-based methods are limited and often overlook regions with coding potential that affect the function of lncRNA. Therefore, we propose a new model called LncSL. For feature encoding, both lncRNA sequences and amino acid sequences from open reading frames (ORFs) are employed. And we selected the most suitable features by CatBoost and integrated them into a new feature set. Additionally, a voting process with seven feature selection algorithms identified the higher contributive features for training our final stacked model. Additionally, an automatic model selection strategy is constructed to find a better performance meta-model for assembling LncSL. This study specifically focuses on predicting the subcellular localization of lncRNA in the nucleus and cytoplasm. On two benchmark datasets called S1 and S2 datasets, LncSL outperformed existing methods by 6.3% to 12.3% in the Matthew's correlation coefficient on a balanced test dataset. On an unbalanced independent test dataset sourced from S1, LncSL improved by 4.7% to 18.6% in the Matthew's correlation coefficient, which further demonstrates that LncSL is superior to other compared methods. In all, this study presents an effective method for predicting lncRNA subcellular localization through enhancing sequence information, which is always overlooked by traditional methods, and addressing contributive meta-model selection problems, which can offer new insights for other bioinformatics problems.
Collapse
Affiliation(s)
| | | | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China; (L.Z.); (H.C.)
| |
Collapse
|
3
|
Liu X, Zhu B, Dai XW, Xu ZA, Li R, Qian Y, Lu YP, Zhang W, Liu Y, Zheng J. GBDT_KgluSite: An improved computational prediction model for lysine glutarylation sites based on feature fusion and GBDT classifier. BMC Genomics 2023; 24:765. [PMID: 38082413 PMCID: PMC10712101 DOI: 10.1186/s12864-023-09834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Lysine glutarylation (Kglu) is one of the most important Post-translational modifications (PTMs), which plays significant roles in various cellular functions, including metabolism, mitochondrial processes, and translation. Therefore, accurate identification of the Kglu site is important for elucidating protein molecular function. Due to the time-consuming and expensive limitations of traditional biological experiments, computational-based Kglu site prediction research is gaining more and more attention. RESULTS In this paper, we proposed GBDT_KgluSite, a novel Kglu site prediction model based on GBDT and appropriate feature combinations, which achieved satisfactory performance. Specifically, seven features including sequence-based features, physicochemical property-based features, structural-based features, and evolutionary-derived features were used to characterize proteins. NearMiss-3 and Elastic Net were applied to address data imbalance and feature redundancy issues, respectively. The experimental results show that GBDT_KgluSite has good robustness and generalization ability, with accuracy and AUC values of 93.73%, and 98.14% on five-fold cross-validation as well as 90.11%, and 96.75% on the independent test dataset, respectively. CONCLUSION GBDT_KgluSite is an effective computational method for identifying Kglu sites in protein sequences. It has good stability and generalization ability and could be useful for the identification of new Kglu sites in the future. The relevant code and dataset are available at https://github.com/flyinsky6/GBDT_KgluSite .
Collapse
Affiliation(s)
- Xin Liu
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
| | - Bao Zhu
- Cancer Institute, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Xia-Wei Dai
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Zhi-Ao Xu
- School of Life Sciences, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Rui Li
- School of Life Sciences, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Yuting Qian
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Ya-Ping Lu
- School of Humanities and Arts, China University of Mining and Technology, Xuzhou, Jiangsu, 221116, China
| | - Wenqing Zhang
- School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Yong Liu
- Cancer Institute, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
| | - Junnian Zheng
- Jiangsu Center for the Collaboration and Innovation of Cancer Biotherapy, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China.
- Center of Clinical Oncology, The Affiliated Hospital of Xuzhou Medical University, Xuzhou, Jiangsu, 221002, China.
| |
Collapse
|
4
|
Identification of adaptor proteins using the ANOVA feature selection technique. Methods 2022; 208:42-47. [DOI: 10.1016/j.ymeth.2022.10.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 10/01/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
|
5
|
AoP-LSE: Antioxidant Proteins Classification Using Deep Latent Space Encoding of Sequence Features. Curr Issues Mol Biol 2021; 43:1489-1501. [PMID: 34698113 PMCID: PMC8928959 DOI: 10.3390/cimb43030105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/28/2021] [Accepted: 09/29/2021] [Indexed: 11/16/2022] Open
Abstract
It is of utmost importance to develop a computational method for accurate prediction of antioxidants, as they play a vital role in the prevention of several diseases caused by oxidative stress. In this correspondence, we present an effective computational methodology based on the notion of deep latent space encoding. A deep neural network classifier fused with an auto-encoder learns class labels in a pruned latent space. This strategy has eliminated the need to separately develop classifier and the feature selection model, allowing the standalone model to effectively harness discriminating feature space and perform improved predictions. A thorough analytical study has been presented alongwith the PCA/tSNE visualization and PCA-GCNR scores to show the discriminating power of the proposed method. The proposed method showed a high MCC value of 0.43 and a balanced accuracy of 76.2%, which is superior to the existing models. The model has been evaluated on an independent dataset during which it outperformed the contemporary methods by correctly identifying the novel proteins with an accuracy of 95%.
Collapse
|