1
|
Wu J, Liu Y, Zhu Y, Yu DJ. Improving Antifreeze Proteins Prediction With Protein Language Models and Hybrid Feature Extraction Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2349-2358. [PMID: 39316498 DOI: 10.1109/tcbb.2024.3467261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Accurate identification of antifreeze proteins (AFPs) is crucial in developing biomimetic synthetic anti-icing materials and low-temperature organ preservation materials. Although numerous machine learning-based methods have been proposed for AFPs prediction, the complex and diverse nature of AFPs limits the prediction performance of existing methods. In this study, we propose AFP-Deep, a new deep learning method to predict antifreeze proteins by integrating embedding from protein sequences with pre-trained protein language models and evolutionary contexts with hybrid feature extraction networks. The experimental results demonstrated that the main advantage of AFP-Deep is its utilization of pre-trained protein language models, which can extract discriminative global contextual features from protein sequences. Additionally, the hybrid deep neural networks designed for protein language models and evolutionary context feature extraction enhance the correlation between embeddings and antifreeze pattern. The performance evaluation results show that AFP-Deep achieves superior performance compared to state-of-the-art models on benchmark datasets, achieving an AUPRC of 0.724 and 0.924, respectively.
Collapse
|
2
|
Xu Y, Zhu F, Zhou Z, Ma S, Zhang P, Tan C, Luo Y, Qin R, Chen J, Pan P. A novel mRNA multi-epitope vaccine of Acinetobacter baumannii based on multi-target protein design in immunoinformatic approach. BMC Genomics 2024; 25:791. [PMID: 39160492 PMCID: PMC11334330 DOI: 10.1186/s12864-024-10691-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 08/06/2024] [Indexed: 08/21/2024] Open
Abstract
Acinetobacter baumannii is a gram-negative bacillus prevalent in nature, capable of thriving under various environmental conditions. As an opportunistic pathogen, it frequently causes nosocomial infections such as urinary tract infections, bacteremia, and pneumonia, contributing to increased morbidity and mortality in clinical settings. Consequently, developing novel vaccines against Acinetobacter baumannii is of utmost importance. In our study, we identified 10 highly conserved antigenic proteins from the NCBI and UniProt databases for epitope mapping. We subsequently screened and selected 8 CTL, HTL, and LBL epitopes, integrating them into three distinct vaccines constructed with adjuvants. Following comprehensive evaluations of immunological and physicochemical parameters, we conducted molecular docking and molecular dynamics simulations to assess the efficacy and stability of these vaccines. Our findings indicate that all three multi-epitope mRNA vaccines designed against Acinetobacter baumannii are promising; however, further animal studies are required to confirm their reliability and effectiveness.
Collapse
Affiliation(s)
- Yizhong Xu
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Fei Zhu
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Ziyou Zhou
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Shiyang Ma
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Peipei Zhang
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Caixia Tan
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Yuying Luo
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Rongliu Qin
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Jie Chen
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China.
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China.
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China.
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China.
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China.
| | - Pinhua Pan
- Department of Respiratory Medicine, National Key Clinical Specialty, Branch of National Clinical Research Center for Respiratory Disease, Xiangya Hospital, Central South University, Changsha, Hunan, China.
- Center of Respiratory Medicine, Xiangya Hospital, Central South University, Changsha, Hunan, China.
- Clinical Research Center for Respiratory Diseases in Hunan Province, Changsha, Hunan, China.
- Hunan Engineering Research Center for Intelligent Diagnosis and Treatment of Respiratory Disease, Changsha, Hunan, China.
- Department of Infection Control Center of Xiangya Hospital, Central South University, Changsha, Hunan, China.
| |
Collapse
|
3
|
Hu F, Chang F, Tao L, Sun X, Liu L, Zhao Y, Han Z, Li C. Prediction of Protein Allosteric Sites with Transfer Entropy and Spatial Neighbor-Based Evolutionary Information Learned by an Ensemble Model. J Chem Inf Model 2024; 64:6197-6204. [PMID: 39075972 DOI: 10.1021/acs.jcim.4c00544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
Allostery is one of the most direct and efficient ways to regulate protein functions. The diverse allosteric sites make it possible to design allosteric modulators of differential selectivity and improved safety compared with those of orthosteric drugs targeting conserved orthosteric sites. Here, we develop an ensemble machine learning method AllosES to predict protein allosteric sites in which the new and effective features are utilized, including the entropy transfer-based dynamic property, secondary structure features, and our previously proposed spatial neighbor-based evolutionary information besides the traditional physicochemical properties. To overcome the class imbalance problem, the multiple grouping strategy is proposed, which is applied to feature selection and model construction. The ensemble model is constructed where multiple submodels are trained on multiple training subsets, respectively, and their results are then integrated to be the final output. AllosES achieves a prediction performance of 0.556 MCC on the independent test set D24, and additionally, AllosES can rank the real allosteric sites in the top three for 83.3/89.3% of allosteric proteins from the test set D24/D28, outperforming the state-of-the-art peer methods. The comprehensive results demonstrate that AllosES is a promising method for protein allosteric site prediction. The source code is available at https://github.com/ChunhuaLab/AllosES.
Collapse
Affiliation(s)
- Fangrui Hu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fubin Chang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Lianci Tao
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Xiaohan Sun
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Lamei Liu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Yingchun Zhao
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Zhongjie Han
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Chunhua Li
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
4
|
Sun X, Yang S, Wu Z, Su J, Hu F, Chang F, Li C. PMSPcnn: Predicting protein stability changes upon single point mutations with convolutional neural network. Structure 2024; 32:838-848.e3. [PMID: 38508191 DOI: 10.1016/j.str.2024.02.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/19/2023] [Accepted: 02/22/2024] [Indexed: 03/22/2024]
Abstract
Protein missense mutations and resulting protein stability changes are important causes for many human genetic diseases. However, the accurate prediction of stability changes due to mutations remains a challenging problem. To address this problem, we have developed an unbiased effective model: PMSPcnn that is based on a convolutional neural network. We have included an anti-symmetry property to build a balanced training dataset, which improves the prediction, in particular for stabilizing mutations. Persistent homology, which is an effective approach for characterizing protein structures, is used to obtain topological features. Additionally, a regression stratification cross-validation scheme has been proposed to improve the prediction for mutations with extreme ΔΔG. For three test datasets: Ssym, p53, and myoglobin, PMSPcnn achieves a better performance than currently existing predictors. PMSPcnn also outperforms currently available methods for membrane proteins. Overall, PMSPcnn is a promising method for the prediction of protein stability changes caused by single point mutations.
Collapse
Affiliation(s)
- Xiaohan Sun
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Shuang Yang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Zhixiang Wu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Jingjie Su
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fangrui Hu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fubin Chang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Chunhua Li
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
5
|
Li X, Wang GA, Wei Z, Wang H, Zhu X. Protein-DNA interface hotspots prediction based on fusion features of embeddings of protein language model and handcrafted features. Comput Biol Chem 2023; 107:107970. [PMID: 37866116 DOI: 10.1016/j.compbiolchem.2023.107970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 10/06/2023] [Accepted: 10/07/2023] [Indexed: 10/24/2023]
Abstract
The identification of hotspot residues at the protein-DNA binding interfaces plays a crucial role in various aspects such as drug discovery and disease treatment. Although experimental methods such as alanine scanning mutagenesis have been developed to determine the hotspot residues on protein-DNA interfaces, they are both inefficient and costly. Therefore, it is highly necessary to develop efficient and accurate computational methods for predicting hotspot residues. Several computational methods have been developed, however, they are mainly based on hand-crafted features which may not be able to represent all the information of proteins. In this regard, we propose a model called PDH-EH, which utilizes fused features of embeddings extracted from a protein language model (PLM) and handcrafted features. After we extracted the total 1141 dimensional features, we used mRMR to select the optimal feature subset. Based on the optimal feature subset, several different learning algorithms such as Random Forest, Support Vector Machine, and XGBoost were used to build the models. The cross-validation results on the training dataset show that the model built by using Random Forest achieves the highest AUROC. Further evaluation on the independent test set shows that our model outperforms the existing state-of-the-art models. Moreover, the effectiveness and interpretability of embeddings extracted from PLM were demonstrated in our analysis. The codes and datasets used in this study are available at: https://github.com/lixiangli01/PDH-EH.
Collapse
Affiliation(s)
- Xiang Li
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Gang-Ao Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Hong Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
6
|
Li K, Wu H, Yue Z, Sun Y, Xia C. A convolutional network and attention mechanism-based approach to predict protein-RNA binding residues. Comput Biol Chem 2023; 105:107901. [PMID: 37327559 DOI: 10.1016/j.compbiolchem.2023.107901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 05/29/2023] [Accepted: 05/31/2023] [Indexed: 06/18/2023]
Abstract
Protein-RNA interactions play a key role in various biological cellular processes, and many experimental and computational studies have been initiated to analyze their interactions. However, experimental determination is quite complex and expensive. Therefore, researchers have worked to develop efficient computational tools to detect protein-RNA binding residues. The accuracy of existing methods is limited by the features of the target and the performance of the computational models; there remains room for improvement. To solve the problem of the accurate detection of protein-RNA binding residues, we propose a convolutional network model named PBRPre based on improved MobileNet. First, by extracting the position information of the target complex and the 3-mer amino acid feature data, the position-specific scoring matrix (PSSM) is improved by using spatial neighbor smoothing processing and discrete wavelet transform to fully exploit the spatial structure information of the target and enrich the feature dataset. Second, the deep learning model MobileNet is used to integrate and optimize the potential features in the target complexes; then, by introducing the Vision Transformer (ViT) network classification layer, the deep-level information of the target is mined to enhance the processing ability of the model for global information and to improve the detection accuracy of the classifiers. The results show that the AUC value of the model can reach 0.866 in the independent testing dataset, which shows that PBRPre can effectively realize the detection of protein-RNA binding residues. All datasets and resource codes of PBRPre are available at https://github.com/linglewu/PBRPre for academic use.
Collapse
Affiliation(s)
- Ke Li
- School of Information & Computer, Anhui Agricultural University, Hefei, Anhui 230036, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, Anhui 230601, China; Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui 230036, China.
| | - Hongwei Wu
- School of Information & Computer, Anhui Agricultural University, Hefei, Anhui 230036, China; Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhenyu Yue
- School of Information & Computer, Anhui Agricultural University, Hefei, Anhui 230036, China; Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yu Sun
- School of Information & Computer, Anhui Agricultural University, Hefei, Anhui 230036, China; Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Chuan Xia
- Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
7
|
Bheemireddy S, Sandhya S, Srinivasan N, Sowdhamini R. Computational tools to study RNA-protein complexes. Front Mol Biosci 2022; 9:954926. [PMID: 36275618 PMCID: PMC9585174 DOI: 10.3389/fmolb.2022.954926] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 09/20/2022] [Indexed: 11/19/2022] Open
Abstract
RNA is the key player in many cellular processes such as signal transduction, replication, transport, cell division, transcription, and translation. These diverse functions are accomplished through interactions of RNA with proteins. However, protein–RNA interactions are still poorly derstood in contrast to protein–protein and protein–DNA interactions. This knowledge gap can be attributed to the limited availability of protein-RNA structures along with the experimental difficulties in studying these complexes. Recent progress in computational resources has expanded the number of tools available for studying protein-RNA interactions at various molecular levels. These include tools for predicting interacting residues from primary sequences, modelling of protein-RNA complexes, predicting hotspots in these complexes and insights into derstanding in the dynamics of their interactions. Each of these tools has its strengths and limitations, which makes it significant to select an optimal approach for the question of interest. Here we present a mini review of computational tools to study different aspects of protein-RNA interactions, with focus on overall application, development of the field and the future perspectives.
Collapse
Affiliation(s)
- Sneha Bheemireddy
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Sankaran Sandhya
- Department of Biotechnology, Faculty of Life and Allied Health Sciences, M.S. Ramaiah University of Applied Sciences, Bengaluru, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| | | | - Ramanathan Sowdhamini
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- National Centre for Biological Sciences, TIFR, GKVK Campus, Bangalore, India
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| |
Collapse
|
8
|
Evaluation of the Effectiveness of Derived Features of AlphaFold2 on Single-Sequence Protein Binding Site Prediction. BIOLOGY 2022; 11:biology11101454. [PMID: 36290358 PMCID: PMC9598995 DOI: 10.3390/biology11101454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 09/30/2022] [Accepted: 09/30/2022] [Indexed: 11/06/2022]
Abstract
Simple Summary With the development of artificial intelligence, researchers can roughly predict the crystal structure of a protein by computer without the need for biological experiments, which provides new ideas and solutions to problems, such as protein-protein interaction and drug-target predictions. In this study, we proposed strategies to combine predicted protein structures with deep learning networks and evaluated them on different protein binding site prediction tasks. Our computational experiment results showed that all proposed strategies could effectively encode structural information for deep learning models. Abstract Though AlphaFold2 has attained considerably high precision on protein structure prediction, it is reported that directly inputting coordinates into deep learning networks cannot achieve desirable results on downstream tasks. Thus, how to process and encode the predicted results into effective forms that deep learning models can understand to improve the performance of downstream tasks is worth exploring. In this study, we tested the effects of five processing strategies of coordinates on two single-sequence protein binding site prediction tasks. These five strategies are spatial filtering, the singular value decomposition of a distance map, calculating the secondary structure feature, and the relative accessible surface area feature of proteins. The computational experiment results showed that all strategies were suitable and effective methods to encode structural information for deep learning models. In addition, by performing a case study of a mutated protein, we showed that the spatial filtering strategy could introduce structural changes into HHblits profiles and deep learning networks when protein mutation happens. In sum, this work provides new insight into the downstream tasks of protein-molecule interaction prediction, such as predicting the binding residues of proteins and estimating the effects of mutations.
Collapse
|
9
|
BERT-PPII: The Polyproline Type II Helix Structure Prediction Model Based on BERT and Multichannel CNN. BIOMED RESEARCH INTERNATIONAL 2022; 2022:9015123. [PMID: 36060139 PMCID: PMC9433275 DOI: 10.1155/2022/9015123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 08/01/2022] [Accepted: 08/03/2022] [Indexed: 11/26/2022]
Abstract
Predicting the polyproline type II (PPII) helix structure is crucial important in many research areas, such as the protein folding mechanisms, the drug targets, and the protein functions. However, many existing PPII helix prediction algorithms encode the protein sequence information in a single way, which causes the insufficient learning of protein sequence feature information. To improve the protein sequence encoding performance, this paper proposes a BERT-based PPII helix structure prediction algorithm (BERT-PPII), which learns the protein sequence information based on the BERT model. The BERT model's CLS vector can fairly fuse sample's each amino acid residue information. Thus, we utilize the CLS vector as the global feature to represent the sample's global contextual information. As the interactions among the protein chains' local amino acid residues have an important influence on the formation of PPII helix, we utilize the CNN to extract local amino acid residues' features which can further enhance the information expression of protein sequence samples. In this paper, we fuse the CLS vectors with CNN local features to improve the performance of predicting PPII structure. Compared to the state-of-the-art PPIIPRED method, the experimental results on the unbalanced dataset show that the proposed method improves the accuracy value by 1% on the strict dataset and 2% on the less strict dataset. Correspondingly, the results on the balanced dataset show that the AUCs of the proposed method are 0.826 on the strict dataset and 0.785 on less strict datasets, respectively. For the independent test set, the proposed method has the AUC value of 0.827 on the strict dataset and 0.783 on the less strict dataset. The above experimental results have proved that the proposed BERT-PPII method can achieve a superior performance of predicting the PPII helix.
Collapse
|
10
|
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: A Comprehensive R Package for Generating Evolutionary-based Descriptors of Protein Sequences from PSSM Profiles. BIOLOGY METHODS AND PROTOCOLS 2022; 7:bpac008. [PMID: 35388370 PMCID: PMC8977839 DOI: 10.1093/biomethods/bpac008] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 01/21/2022] [Indexed: 11/14/2022]
Abstract
Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.
Collapse
Affiliation(s)
- Alireza Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Javad Zahiri
- Department of Neuroscience, University of California San Diego, California, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Saber Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mohsen Khodarahmi
- Department of Radiology, Shahid Madani Hospital, Karaj, Iran
- Bahar Medical Imaging Center, Karaj, Iran
- Dr. Khodarahmi Medical Imaging Center, Karaj, Iran
| | - Seyed Shahriar Arab
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
11
|
Zhou T, Rong J, Liu Y, Gong W, Li C. An ensemble approach to predict binding hotspots in protein-RNA interactions based on SMOTE data balancing and random grouping feature selection strategies. Bioinformatics 2022; 38:2452-2458. [PMID: 35253843 DOI: 10.1093/bioinformatics/btac138] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 01/15/2022] [Accepted: 03/02/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The identification of binding hotspots in protein-RNA interactions is crucial for understanding their potential recognition mechanisms and drug design. The experimental methods have many limitations, since they are usually time-consuming and labor-intensive. Thus, developing an effective and efficient theoretical method is urgently needed. RESULTS Here we present SREPRHot, a method to predict hotspots, defined as the residues whose mutation to alanine generate a binding free energy change ≥ 2.0 kcal/mol, while others use a cutoff of 1.0 kcal/mol to obtain balanced datasets. To deal with the dataset imbalance, Synthetic Minority Over-sampling Technique (SMOTE) is utilized to generate minority samples to achieve a dataset balance. Additionally, besides conventional features, we use two types of new features, residue interface propensity previously developed by us, and topological features obtained using node-weighted networks, and propose an effective Random Grouping feature selection strategy combined with a two-step method to determine an optimal feature set. Finally, a stacking ensemble classifier is adopted to build our model. The results show SREPRHot achieves a good performance with SEN, MCC and AUC of 0.900, 0.557 and 0.829 on the independent testing dataset. The comparison study indicates SREPRHot shows a promising performance. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/ChunhuaLiLab/SREPRHot. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tong Zhou
- Falcuty of Environmental and Life Sciences, Beijing University of Technology, Beijing, 100124, China
| | - Jie Rong
- Falcuty of Environmental and Life Sciences, Beijing University of Technology, Beijing, 100124, China
| | - Yang Liu
- Falcuty of Environmental and Life Sciences, Beijing University of Technology, Beijing, 100124, China
| | - Weikang Gong
- Falcuty of Environmental and Life Sciences, Beijing University of Technology, Beijing, 100124, China
| | - Chunhua Li
- Falcuty of Environmental and Life Sciences, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
12
|
Wang P, Zhang G, Yu ZG, Huang G. A Deep Learning and XGBoost-Based Method for Predicting Protein-Protein Interaction Sites. Front Genet 2021; 12:752732. [PMID: 34764983 PMCID: PMC8576272 DOI: 10.3389/fgene.2021.752732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/20/2021] [Indexed: 11/29/2022] Open
Abstract
Knowledge about protein-protein interactions is beneficial in understanding cellular mechanisms. Protein-protein interactions are usually determined according to their protein-protein interaction sites. Due to the limitations of current techniques, it is still a challenging task to detect protein-protein interaction sites. In this article, we presented a method based on deep learning and XGBoost (called DeepPPISP-XGB) for predicting protein-protein interaction sites. The deep learning model served as a feature extractor to remove redundant information from protein sequences. The Extreme Gradient Boosting algorithm was used to construct a classifier for predicting protein-protein interaction sites. The DeepPPISP-XGB achieved the following results: area under the receiver operating characteristic curve of 0.681, a recall of 0.624, and area under the precision-recall curve of 0.339, being competitive with the state-of-the-art methods. We also validated the positive role of global features in predicting protein-protein interaction sites.
Collapse
Affiliation(s)
- Pan Wang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Guiyang Zhang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| |
Collapse
|