1
|
Fu Y, Yu S, Li J, Lao Z, Yang X, Lin Z. DeepMineLys: Deep mining of phage lysins from human microbiome. Cell Rep 2024; 43:114583. [PMID: 39110597 DOI: 10.1016/j.celrep.2024.114583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 06/21/2024] [Accepted: 07/19/2024] [Indexed: 09/01/2024] Open
Abstract
Vast shotgun metagenomics data remain an underutilized resource for novel enzymes. Artificial intelligence (AI) has increasingly been applied to protein mining, but its conventional performance evaluation is interpolative in nature, and these trained models often struggle to extrapolate effectively when challenged with unknown data. In this study, we present a framework (DeepMineLys [deep mining of phage lysins from human microbiome]) based on the convolutional neural network (CNN) to identify phage lysins from three human microbiome datasets. When validated with an independent dataset, our method achieved an F1-score of 84.00%, surpassing existing methods by 20.84%. We expressed 16 lysin candidates from the top 100 sequences in E. coli, confirming 11 as active. The best one displayed an activity 6.2-fold that of lysozyme derived from hen egg white, establishing it as the most potent lysin from the human microbiome. Our study also underscores several important issues when applying AI to biology questions. This framework should be applicable for mining other proteins.
Collapse
Affiliation(s)
- Yiran Fu
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Shuting Yu
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Jianfeng Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Zisha Lao
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Xiaofeng Yang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
| | - Zhanglin Lin
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
| |
Collapse
|
2
|
Hao T, Zhang M, Song Z, Gou Y, Wang B, Sun J. Reconstruction of Eriocheir sinensis Protein-Protein Interaction Network Based on DGO-SVM Method. Curr Issues Mol Biol 2024; 46:7353-7372. [PMID: 39057077 PMCID: PMC11276262 DOI: 10.3390/cimb46070436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 06/25/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024] Open
Abstract
Eriocheir sinensis is an economically important aquatic animal. Its regulatory mechanisms underlying many biological processes are still vague due to the lack of systematic analysis tools. The protein-protein interaction network (PIN) is an important tool for the systematic analysis of regulatory mechanisms. In this work, a novel machine learning method, DGO-SVM, was applied to predict the protein-protein interaction (PPI) in E. sinensis, and its PIN was reconstructed. With the domain, biological process, molecular functions and subcellular locations of proteins as the features, DGO-SVM showed excellent performance in Bombyx mori, humans and five aquatic crustaceans, with 92-96% accuracy. With DGO-SVM, the PIN of E. sinensis was reconstructed, containing 14,703 proteins and 7,243,597 interactions, in which 35,604 interactions were associated with 566 novel proteins mainly involved in the response to exogenous stimuli, cellular macromolecular metabolism and regulation. The DGO-SVM demonstrated that the biological process, molecular functions and subcellular locations of proteins are significant factors for the precise prediction of PPIs. We reconstructed the largest PIN for E. sinensis, which provides a systematic tool for the regulatory mechanism analysis. Furthermore, the novel-protein-related PPIs in the PIN may provide important clues for the mechanism analysis of the underlying specific physiological processes in E. sinensis.
Collapse
Affiliation(s)
| | | | | | | | - Bin Wang
- Tianjin Key Laboratory of Animal and Plant Resistance, College of Life Sciences, Tianjin Normal University, Tianjin 300387, China; (T.H.); (M.Z.); (Z.S.); (Y.G.)
| | - Jinsheng Sun
- Tianjin Key Laboratory of Animal and Plant Resistance, College of Life Sciences, Tianjin Normal University, Tianjin 300387, China; (T.H.); (M.Z.); (Z.S.); (Y.G.)
| |
Collapse
|
3
|
Zhu Y, Sun A. LGC-DBP: the method of DNA-binding protein identification based on PSSM and deep learning. Front Genet 2024; 15:1411847. [PMID: 38903752 PMCID: PMC11188361 DOI: 10.3389/fgene.2024.1411847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 05/14/2024] [Indexed: 06/22/2024] Open
Abstract
The recognition of DNA Binding Proteins (DBPs) plays a crucial role in understanding biological functions such as replication, transcription, and repair. Although current sequence-based methods have shown some effectiveness, they often fail to fully utilize the potential of deep learning in capturing complex patterns. This study introduces a novel model, LGC-DBP, which integrates Long Short-Term Memory (LSTM), Gated Inception Convolution, and Improved Channel Attention mechanisms to enhance the prediction of DBPs. Initially, the model transforms protein sequences into Position Specific Scoring Matrices (PSSM), then processed through our deep learning framework. Within this framework, Gated Inception Convolution merges the concepts of gating units with the advantages of Graph Convolutional Network (GCN) and Dilated Convolution, significantly surpassing traditional convolution methods. The Improved Channel Attention mechanism substantially enhances the model's responsiveness and accuracy by shifting from a single input to three inputs and integrating three sigmoid functions along with an additional layer output. These innovative combinations have significantly improved model performance, enabling LGC-DBP to recognize and interpret the complex relationships within DBP features more accurately. The evaluation results show that LGC-DBP achieves an accuracy of 88.26% and a Matthews correlation coefficient of 0.701, both surpassing existing methods. These achievements demonstrate the model's strong capability in integrating and analyzing multi-dimensional data and mark a significant advancement over traditional methods by capturing deeper, nonlinear interactions within the data.
Collapse
Affiliation(s)
- Yiqi Zhu
- Department of Computer Science and Technology, College of Computer and Control Engineering, Northeast Forestry University, Harbin, China
| | | |
Collapse
|
4
|
Sun A, Li H, Dong G, Zhao Y, Zhang D. DBPboost:A method of classification of DNA-binding proteins based on improved differential evolution algorithm and feature extraction. Methods 2024; 223:56-64. [PMID: 38237792 DOI: 10.1016/j.ymeth.2024.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/29/2023] [Accepted: 01/13/2024] [Indexed: 02/01/2024] Open
Abstract
DNA-binding proteins are a class of proteins that can interact with DNA molecules through physical and chemical interactions. Their main functions include regulating gene expression, maintaining chromosome structure and stability, and more. DNA-binding proteins play a crucial role in cellular and molecular biology, as they are essential for maintaining normal cellular physiological functions and adapting to environmental changes. The prediction of DNA-binding proteins has been a hot topic in the field of bioinformatics. The key to accurately classifying DNA-binding proteins is to find suitable feature sources and explore the information they contain. Although there are already many models for predicting DNA-binding proteins, there is still room for improvement in mining feature source information and calculation methods. In this study, we created a model called DBPboost to better identify DNA-binding proteins. The innovation of this study lies in the use of eight feature extraction methods, the improvement of the feature selection step, which involves selecting some features first and then performing feature selection again after feature fusion, and the optimization of the differential evolution algorithm in feature fusion, which improves the performance of feature fusion. The experimental results show that the prediction accuracy of the model on the UniSwiss dataset is 89.32%, and the sensitivity is 89.01%, which is better than most existing models.
Collapse
Affiliation(s)
- Ailun Sun
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Hongfei Li
- College of Life Science, Northeast Forestry University, Harbin 150040, China
| | - Guanghui Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dandan Zhang
- Department of Obstetrics and Gynecology, the First Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| |
Collapse
|
5
|
Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024; 52:e10. [PMID: 38048333 PMCID: PMC10810184 DOI: 10.1093/nar/gkad1131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 11/10/2023] [Indexed: 12/06/2023] Open
Abstract
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, PR China
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
6
|
Manavi F, Sharma A, Sharma R, Tsunoda T, Shatabda S, Dehzangi I. CNN-Pred: Prediction of single-stranded and double-stranded DNA-binding protein using convolutional neural networks. Gene X 2023; 853:147045. [PMID: 36503892 DOI: 10.1016/j.gene.2022.147045] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 10/10/2022] [Accepted: 11/08/2022] [Indexed: 11/27/2022] Open
Abstract
DNA-binding proteins play a vital role in biological activity including DNA replication, DNA packing, and DNA reparation. DNA-binding proteins can be classified into single-stranded DNA-binding proteins (SSBs) or double-stranded DNA-binding proteins (DSBs). Determining whether a protein is DSB or SSB helps determine the protein's function. Therefore, many studies have been conducted to accurately identify DSB and SSB in recent years. Despite all the efforts have been made so far, the DSB and SSB prediction performance remains limited. In this study, we propose a new method called CNN-Pred to accurately predict DSB and SSB. To build CNN-Pred, we first extract evolutionary-based features in the form of mono-gram and bi-gram profiles using position specific scoring matrix (PSSM). We then, use 1D-convolutional neural network (CNN) as the classifier to our extracted features. Our results demonstrate that CNN-Pred can enhance the DSB and SSB prediction accuracies by more than 4%, on the independent test compared to previous studies found in the literature. CNN-pred as a standalone tool and all its source codes are publicly available at: https://github.com/MLBC-lab/CNN-Pred.
Collapse
Affiliation(s)
- Farnoush Manavi
- Computer Science and Engineering and Information Technology Department, Shiraz University, Shiraz, Iran
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane, QLD 4111, Australia
| | - Ronesh Sharma
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan; Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo 113-0033, Japan; Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo 113-0033, Japan
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA; Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| |
Collapse
|
7
|
Zhao X, Zhai J, Liu T, Wang G. Ensemble classification based feature selection: a case of identification on plant pentatricopeptide repeat proteins. Brief Bioinform 2022; 23:6760138. [PMID: 36239380 DOI: 10.1093/bib/bbac369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/20/2022] [Accepted: 08/05/2022] [Indexed: 12/14/2022] Open
Abstract
In order to identify plant pentatricopeptide repeat (PPR) proteins, a framework of variable selection has been proposed. In fact, it is an effective feature selection strategy that focuses on the performance of classification. Random forest has been used as the classifier with certain variables automatically selected for discrimination between PPR functional and non-functional proteins. However, it is found that samples regarded as PPR functional proteins are wrongly classified in a high rate. In this paper, we plan to improve the framework in order to achieve better classification results. Modifications are made on the framework for better identifying PPR functional proteins. Instead of random forest, a hybrid ensemble classifier is built with its base classifiers derived from six different classification methods. Besides, an incremental strategy and a clustering by search in descending order are alternatively used for feature selection, which can effectively select the most representative variables for identification on PPR proteins. In addition, it can be found that different base classifiers alternately play an important role in the ensemble classifier with feature dimension increasing. The experimental results demonstrate the effectiveness of our improvements.
Collapse
Affiliation(s)
- Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Jingwen Zhai
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Tong Liu
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China.,State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| |
Collapse
|