1
|
Teimouri H, Medvedeva A, Kolomeisky AB. Unraveling the role of physicochemical differences in predicting protein-protein interactions. J Chem Phys 2024; 161:045102. [PMID: 39051836 DOI: 10.1063/5.0219501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 07/09/2024] [Indexed: 07/27/2024] Open
Abstract
The ability to accurately predict protein-protein interactions is critically important for understanding major cellular processes. However, current experimental and computational approaches for identifying them are technically very challenging and still have limited success. We propose a new computational method for predicting protein-protein interactions using only primary sequence information. It utilizes the concept of physicochemical similarity to determine which interactions will most likely occur. In our approach, the physicochemical features of proteins are extracted using bioinformatics tools for different organisms. Then they are utilized in a machine-learning method to identify successful protein-protein interactions via correlation analysis. It was found that the most important property that correlates most with the protein-protein interactions for all studied organisms is dipeptide amino acid composition (the frequency of specific amino acid pairs in a protein sequence). While current approaches often overlook the specificity of protein-protein interactions with different organisms, our method yields context-specific features that determine protein-protein interactions. The analysis is specifically applied to the bacterial two-component system that includes histidine kinase and transcriptional response regulators, as well as to the barnase-barstar complex, demonstrating the method's versatility across different biological systems. Our approach can be applied to predict protein-protein interactions in any biological system, providing an important tool for investigating complex biological processes' mechanisms.
Collapse
Affiliation(s)
- Hamid Teimouri
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| | - Angela Medvedeva
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| | - Anatoly B Kolomeisky
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
2
|
Rodella C, Lazaridi S, Lemmin T. TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms. BIOINFORMATICS ADVANCES 2024; 4:vbae103. [PMID: 39040220 PMCID: PMC11262459 DOI: 10.1093/bioadv/vbae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 06/14/2024] [Accepted: 07/12/2024] [Indexed: 07/24/2024]
Abstract
Motivation Understanding protein thermostability is essential for numerous biotechnological applications, but traditional experimental methods are time-consuming, expensive, and error-prone. Recently, deep learning (DL) techniques from natural language processing (NLP) was extended to the field of biology, since the primary sequence of proteins can be viewed as a string of amino acids that follow a physicochemical grammar. Results In this study, we developed TemBERTure, a DL framework that predicts thermostability class and melting temperature from protein sequences. Our findings emphasize the importance of data diversity for training robust models, especially by including sequences from a wider range of organisms. Additionally, we suggest using attention scores from Deep Learning models to gain deeper insights into protein thermostability. Analyzing these scores in conjunction with the 3D protein structure can enhance understanding of the complex interactions among amino acid properties, their positioning, and the surrounding microenvironment. By addressing the limitations of current prediction methods and introducing new exploration avenues, this research paves the way for more accurate and informative protein thermostability predictions, ultimately accelerating advancements in protein engineering. Availability and implementation TemBERTure model and the data are available at: https://github.com/ibmm-unibe-ch/TemBERTure.
Collapse
Affiliation(s)
- Chiara Rodella
- Institute of Biochemistry and Molecular Medicine (IBMM), University of Bern, Bern CH-3012, Switzerland
- Graduate School for Cellular and Biomedical Sciences (GCB), University of Bern, Bern CH-3012, Switzerland
| | - Symela Lazaridi
- Institute of Biochemistry and Molecular Medicine (IBMM), University of Bern, Bern CH-3012, Switzerland
- Graduate School for Cellular and Biomedical Sciences (GCB), University of Bern, Bern CH-3012, Switzerland
| | - Thomas Lemmin
- Institute of Biochemistry and Molecular Medicine (IBMM), University of Bern, Bern CH-3012, Switzerland
| |
Collapse
|
3
|
Karim T, Shaon MSH, Sultan MF, Hasan MZ, Kafy AA. ANNprob-ACPs: A novel anticancer peptide identifier based on probabilistic feature fusion approach. Comput Biol Med 2024; 169:107915. [PMID: 38171261 DOI: 10.1016/j.compbiomed.2023.107915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/05/2024]
Abstract
Anticancer Peptides (ACPs) offer significant potential as cancer treatment drugs in this modern era. Quickly identifying active compounds from protein sequences is crucial for healthcare and cancer treatment. In this paper ANNprob-ACPs, a novel and effective model for detecting ACPs has been implemented based on nine feature encoding techniques, including AAC, CC, W2V, DPC, PAAC, QSO, CTDC, CTDT, and CKSAAGP. After analyzing the performance of several machine learning models, the six best models were selected based on their overall performances in every evaluation metric. The probability scores of each model were subsequently aggregated and used as input of our meta- model, called ANNprob-ACPs. Our model outperformed all others and its potential to lead to phenomenal identification of ACPs. The results of this study showed notable improvement in 10-fold cross-validation and independent test, with accuracy of 93.72% and 90.62%, respectively. Our proposed model, ANNprob-ACPs outperformed existing approaches in terms of accuracy and effectiveness in discovering ACPs. By using SHAP, this study obtained the physicochemical properties of QSO, and compositional properties of DPC, AAC, and PAAC are more impactful for our model's performances, which have a major impact on a drug's interactions and future discoveries. Consequently, this model is crucial for the future and has a high probability of detecting ACPs more frequently. We developed a web server of ANNprob-ACPs, which is accessible at ANNprob-ACPs webserver.
Collapse
Affiliation(s)
- Tasmin Karim
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Shazzad Hossain Shaon
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Fahim Sultan
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Zahid Hasan
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Abdulla-Al Kafy
- Department of Urban & Regional Planning, Rajshahi University of Engineering & Technology (RUET), Rajshahi, 6204, Bangladesh.
| |
Collapse
|
4
|
Yang B, Wang L, Bao W. Identify Diabetes-related Targets based on ForgeNet_GPC. Curr Comput Aided Drug Des 2024; 20:1042-1054. [PMID: 38173214 DOI: 10.2174/0115734099258183230929173855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/06/2023] [Accepted: 08/15/2023] [Indexed: 01/05/2024]
Abstract
BACKGROUND Research on potential therapeutic targets and new mechanisms of action can greatly improve the efficiency of new drug development. AIMS Polygenic genetic diseases, such as diabetes, are caused by the interaction of multiple gene loci and environmental factors. OBJECTIVES In this study, a disease target identification algorithm based on protein recognition is proposed. MATERIALS AND METHODS In this method, the related and unrelated targets are collected from literature databases for treating diabetes. The transcribed proteins corresponding to each target are queried in order to construct a protein dataset. Six protein feature extraction algorithms (AAC, CKSAAGP, DDE, DPC, GAAP, and TPC) are utilized to obtain the feature vectors of each protein, which are merged into the full feature vectors. RESULTS A novel classifier (forgeNet_GPC) based on forgeNet and Gaussian process classifier (GPC) is proposed to classify the proteins. CONCLUSION In forgeNet_GPC, forgeNet is utilized to select the important features, and GPC is utilized to solve the classification problem. The experimental results reveal that forgeNet_GPC performs better than 22 classifiers in terms of ROC-AUC, PR-AUC, MCC, Youden Index, and Kappa.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Linlin Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, 221018, China
| |
Collapse
|
5
|
Yan J, Zhang B, Zhou M, Campbell-Valois FX, Siu SWI. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. mSystems 2023; 8:e0034523. [PMID: 37431995 PMCID: PMC10506472 DOI: 10.1128/msystems.00345-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 05/31/2023] [Indexed: 07/12/2023] Open
Abstract
Antimicrobial peptides (AMPs) are a promising alternative to antibiotics to combat drug resistance in pathogenic bacteria. However, the development of AMPs with high potency and specificity remains a challenge, and new tools to evaluate antimicrobial activity are needed to accelerate the discovery process. Therefore, we proposed MBC-Attention, a combination of a multi-branch convolution neural network architecture and attention mechanisms to predict the experimental minimum inhibitory concentration of peptides against Escherichia coli. The optimal MBC-Attention model achieved an average Pearson correlation coefficient (PCC) of 0.775 and a root mean squared error (RMSE) of 0.533 (log μM) in three independent tests of randomly drawn sequences from the data set. This results in a 5-12% improvement in PCC and a 6-13% improvement in RMSE compared to 17 traditional machine learning models and 2 optimally tuned models using random forest and support vector machine. Ablation studies confirmed that the two proposed attention mechanisms, global attention and local attention, contributed largely to performance improvement. IMPORTANCE Antimicrobial peptides (AMPs) are potential candidates for replacing conventional antibiotics to combat drug resistance in pathogenic bacteria. Therefore, it is necessary to evaluate the antimicrobial activity of AMPs quantitatively. However, wet-lab experiments are labor-intensive and time-consuming. To accelerate the evaluation process, we develop a deep learning method called MBC-Attention to regress the experimental minimum inhibitory concentration of AMPs against Escherichia coli. The proposed model outperforms traditional machine learning methods. Data, scripts to reproduce experiments, and the final production models are available on GitHub.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - François-Xavier Campbell-Valois
- Host-Microbe Interactions Laboratory, Center for Chemical and Synthetic Biology, Department of Chemistry and Biomolecular Sciences, University of Ottawa, Ottawa, Ontario, Canada
- Centre for Infection, Immunity, and Inflammation, University of Ottawa, Ottawa, Ontario, Canada
- Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Macau, China
| |
Collapse
|
6
|
El-Behery H, Attia AF, El-Fishawy N, Torkey H. An ensemble-based drug-target interaction prediction approach using multiple feature information with data balancing. J Biol Eng 2022; 16:21. [PMID: 35941686 PMCID: PMC9361677 DOI: 10.1186/s13036-022-00296-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 06/02/2022] [Indexed: 11/16/2022] Open
Abstract
Background Recently, drug repositioning has received considerable attention for its advantage to pharmaceutical industries in drug development. Artificial intelligence techniques have greatly enhanced drug reproduction by discovering therapeutic drug profiles, side effects, and new target proteins. However, as the number of drugs increases, their targets and enormous interactions produce imbalanced data that might not be preferable as an input to a prediction model immediately. Methods This paper proposes a novel scheme for predicting drug–target interactions (DTIs) based on drug chemical structures and protein sequences. The drug Morgan fingerprint, drug constitutional descriptors, protein amino acid composition, and protein dipeptide composition were employed to extract the drugs and protein’s characteristics. Then, the proposed approach for extracting negative samples using a support vector machine one-class classifier was developed to tackle the imbalanced data problem feature sets from the drug–target dataset. Negative and positive samplings were constructed and fed into different prediction algorithms to identify DTIs. A 10-fold CV validation test procedure was applied to assess the predictability of the proposed method, in addition to the study of the effectiveness of the chemical and physical features in the evaluation and discovery of the drug–target interactions. Results Our experimental model outperformed existing techniques concerning the curve for receiver operating characteristic (AUC), accuracy, precision, recall F-score, mean square error, and MCC. The results obtained by the AdaBoost classifier enhanced prediction accuracy by 2.74%, precision by 1.98%, AUC by 1.14%, F-score by 3.53%, and MCC by 4.54% over existing methods.
Collapse
Affiliation(s)
- Heba El-Behery
- Department of Computer Science and Engineering, Faculty of Engineering, Kafrelsheikh University, Kafr_El_Sheikh, Egypt.
| | - Abdel-Fattah Attia
- Department of Computer Science and Engineering, Faculty of Engineering, Kafrelsheikh University, Kafr_El_Sheikh, Egypt
| | - Nawal El-Fishawy
- Computer Science & Engineering Department, Faculty of Electronic Engineering, Menoufia University, Menouf, Egypt
| | - Hanaa Torkey
- Computer Science & Engineering Department, Faculty of Electronic Engineering, Menoufia University, Menouf, Egypt
| |
Collapse
|
7
|
Yan J, Zhang B, Zhou M, Kwok HF, Siu SWI. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput Biol Med 2022; 147:105717. [PMID: 35752114 DOI: 10.1016/j.compbiomed.2022.105717] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 05/18/2022] [Accepted: 06/05/2022] [Indexed: 11/03/2022]
Abstract
Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - Hang Fai Kwok
- Department of Biomedical Sciences, Faculty of Health Sciences, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China; Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macao Special Administrative Region of China.
| |
Collapse
|
8
|
Ahmed Z, Zulfiqar H, Khan AA, Gul I, Dao FY, Zhang ZY, Yu XL, Tang L. iThermo: A Sequence-Based Model for Identifying Thermophilic Proteins Using a Multi-Feature Fusion Strategy. Front Microbiol 2022; 13:790063. [PMID: 35273581 PMCID: PMC8902591 DOI: 10.3389/fmicb.2022.790063] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 01/10/2022] [Indexed: 01/20/2023] Open
Abstract
Thermophilic proteins have important application value in biotechnology and industrial processes. The correct identification of thermophilic proteins provides important information for the application of these proteins in engineering. The identification method of thermophilic proteins based on biochemistry is laborious, time-consuming, and high cost. Therefore, there is an urgent need for a fast and accurate method to identify thermophilic proteins. Considering this urgency, we constructed a reliable benchmark dataset containing 1,368 thermophilic and 1,443 non-thermophilic proteins. A multi-layer perceptron (MLP) model based on a multi-feature fusion strategy was proposed to discriminate thermophilic proteins from non-thermophilic proteins. On independent data set, the proposed model could achieve an accuracy of 96.26%, which demonstrates that the model has a good application prospect. In order to use the model conveniently, a user-friendly software package called iThermo was established and can be freely accessed at http://lin-group.cn/server/iThermo/index.html. The high accuracy of the model and the practicability of the developed software package indicate that this study can accelerate the discovery and engineering application of thermally stable proteins.
Collapse
Affiliation(s)
- Zahoor Ahmed
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hasan Zulfiqar
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Abdullah Aman Khan
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.,Sichuan Artificial Intelligence Research Institute, Yibin, China
| | - Ijaz Gul
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Tsinghua Shenzhen International Graduate School, Institute of Biopharmaceutical and Health Engineering, Tsinghua University, Shenzhen, China
| | - Fu-Ying Dao
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhao-Yue Zhang
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiao-Long Yu
- School of Materials Science and Engineering, Hainan University, Haikou, China
| | - Lixia Tang
- School of Life Sciences and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
9
|
Charoenkwan P, Chotpatiwetchkul W, Lee VS, Nantasenamat C, Shoombuatong W. A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides. Sci Rep 2021; 11:23782. [PMID: 34893688 PMCID: PMC8664844 DOI: 10.1038/s41598-021-03293-w] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 12/01/2021] [Indexed: 02/08/2023] Open
Abstract
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- grid.7132.70000 0000 9039 7662Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200 Thailand
| | - Warot Chotpatiwetchkul
- grid.419784.70000 0001 0816 7508Applied Computational Chemistry Research Unit, Department of Chemistry, School of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, 10520 Thailand
| | - Vannajan Sanghiran Lee
- grid.10347.310000 0001 2308 5949Department of Chemistry, Centre of Theoretical and Computational Physics, Faculty of Science, University of Malaya, 50603 Kuala Lumpur, Malaysia
| | - Chanin Nantasenamat
- grid.10223.320000 0004 1937 0490Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700 Thailand
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
10
|
Chen YZ, Wang ZZ, Wang Y, Ying G, Chen Z, Song J. nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning. Brief Bioinform 2021; 22:6277413. [PMID: 34002774 DOI: 10.1093/bib/bbab146] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 03/18/2021] [Accepted: 03/25/2021] [Indexed: 12/20/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered type of protein post-translational modification and has been reported to be involved in various pathophysiological processes. High-resolution mass spectrometry is the primary approach for identification of Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and expensive when compared with computational approaches. To date, several predictors for Kcr site prediction have been developed, most of which are capable of predicting crotonylation sites on either histones alone or mixed histone and nonhistone proteins together. These methods exhibit high diversity in their algorithms, encoding schemes, feature selection techniques and performance assessment strategies. However, none of them were designed for predicting Kcr sites on nonhistone proteins. Therefore, it is desirable to develop an effective predictor for identifying Kcr sites from the large amount of nonhistone sequence data. For this purpose, we first provide a comprehensive review on six methods for predicting crotonylation sites. Second, we develop a novel deep learning-based computational framework termed as CNNrgb for Kcr site prediction on nonhistone proteins by integrating different types of features. We benchmark its performance against multiple commonly used machine learning classifiers (including random forest, logitboost, naïve Bayes and logistic regression) by performing both 10-fold cross-validation and independent test. The results show that the proposed CNNrgb framework achieves the best performance with high computational efficiency on large datasets. Moreover, to facilitate users' efforts to investigate Kcr sites on human nonhistone proteins, we implement an online server called nhKcr and compare it with other existing tools to illustrate the utility and robustness of our method. The nhKcr web server and all the datasets utilized in this study are freely accessible at http://nhKcr.erc.monash.edu/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | | | | | - Guoguang Ying
- Laboratory of Tumor Cell Biology in Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
11
|
Abdel-Naby MA, El-Wafa WMA, Salem GEM. Molecular characterization, catalytic, kinetic and thermodynamic properties of protease produced by a mutant of Bacillus cereus-S6-3. Int J Biol Macromol 2020; 160:695-702. [PMID: 32485254 DOI: 10.1016/j.ijbiomac.2020.05.241] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Revised: 05/18/2020] [Accepted: 05/19/2020] [Indexed: 01/19/2023]
Abstract
The proteolytic strain Bacillus cereus-S6-3 was subjected to mutagenic treatments viz. UV irradiations and methyl methane sulfonate (MMS). The obtained mutant strain, B. cereus-S6-3/UM90 showed 1.34 fold over the parent strain. Molecular characterization of proteases from the parent (PP/S6-3) and mutant (PM/UM90) strains indicated that they were consisted of two domains and binds a zinc ion and 4 calcium ions in the active site. Amino acid sequence alignment of PM/UM90 protease showed 19 amino acid residues were substituted compared to that of the wild-type enzyme. However, both proteases contained equal number of aromatic and hydrophobic amino acids. Protease from PM/UM90 showed an effective improvement in thermal properties in terms of reaction temperature, t1/2, the values of kd, activation energy (Ea), and decimal reduction time (D) within the temperature range from 60 to 80 °C. In addition, the kinetic and thermodynamic parameters for substrate hydrolysis (i.e., Km, Vmax, ΔH*, ΔG*, ΔS*, kcat, Vmax/Km, kcat/Km, ΔG*E-T and ΔG*E-S) showed a significant improvement of the catalytic efficiency for PM/UM90 protease. Furthermore, the correlation between thermodynamic properties and the patterns of amino acid substitution of wild-type enzyme to has been discussed.
Collapse
Affiliation(s)
- Mohamed A Abdel-Naby
- Department of Chemistry of Natural and Microbial Products, National Research Center, Cairo, Egypt.
| | | | | |
Collapse
|
12
|
Wang L, Wang M, Shi X, Yang J, Qian C, Liu Q, Zong L, Liu X, Zhu Z, Tang D, Zhang X. Investigation into archaeal extremophilic lifestyles through comparative proteogenomic analysis. J Biomol Struct Dyn 2020; 39:7080-7092. [PMID: 32820705 DOI: 10.1080/07391102.2020.1808531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Archaea are a group of primary life forms on Earth and could thrive in many unique environments. Their successful colonization of extreme niches requires corresponding adaptations at proteogenomic level in order to maintain stable cellular structures and active physiological functions. Although some studies have already investigated the extremophilic lifestyles of archaeal species based on genomic features and protein structures, there is a lack of comparative proteogenomic analysis in a large scale. In this study, we explored 686 high-quality archaeal genomes (proteomes) sourced from the Pathosystems Resource Integration Center (PATRIC) database. General patterns of genomic features such as genome size, coding capacity (coding genes and non-coding regions), and G + C contents were re-confirmed. Protein domain distribution patterns were then identified across archaeal species. Domains with unknown functions (DUFs) and mini proteins were investigated in terms of their distributions due to their importance in archaeal physiological functions. In addition, physicochemical properties of protein sequences, such as stability, hydrophobicity, isoelectric point, aromaticity and amino acid compositions in corresponding archaeal groups were compared. Unique features associated with extremophilic lifestyles were observed, which suggested that evolutionary adaptations to different extreme environments had intrinsic impacts on archaeal protein features. Taken together, this systematic study facilitates a better understanding of the mechanisms behind the extremophilic lifestyles of archaeal species, which will further contribute to the evolutionary explorations of archaeal adaptations both experimentally and theoretically in the future studies.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Liang Wang
- Key Laboratory of Carbohydrate Chemistry and Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, Jiangsu, China.,Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Mengmeng Wang
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xinyi Shi
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Jianye Yang
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Chenlu Qian
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Qinghua Liu
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Lixin Zong
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xin Liu
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Zuobin Zhu
- School of Life Science, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Daoquan Tang
- Jiangsu Key Lab of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Pharmaceutical Analysis, School of Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Xiao Zhang
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China.,Department of Computer Science, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, Jiangsu, China
| |
Collapse
|
13
|
Yu Z, Shi D, Liu W, Meng Y, Meng F. Metabolome responses of Enterococcus faecium to acid shock and nitrite stress. Biotechnol Bioeng 2020; 117:3559-3571. [PMID: 32662876 DOI: 10.1002/bit.27497] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 06/26/2020] [Accepted: 07/08/2020] [Indexed: 01/21/2023]
Abstract
Enterococcus faecium is gaining increasing interest due to its virulence and tolerance to a range of stresses (e.g., acid shock and nitrite stress in human stomach). The chemical taxonomy and basic structural features of cellular metabolite can provide us a deeper understanding of bacterial tolerance at molecular level. Here, we used hierarchical classification and molecular composition analysis to investigate the metabolome responses of E. faecium to acid shock and nitrite stress. Our results showed that considerable high biodegradable compounds (e.g., dipeptides) were produced by E. faecium under acid shock, while nitrite stress induced the accumulations of some low biodegradable compounds (e.g., organoheterocyclic compounds and benzenoids). Complete genome analysis and metabolic pathway profiling suggested that E. faecium produced high biodegradable metabolites responsible for the proton-translocation and biofilm formation, which increase its tolerance to acid shock. Yet, the presence of low biodegradable metabolites due to the nitrite exposure could disturb the bacterial productions of surface proteins, and thus inhibiting biofilm formation. Our approach uncovered the hidden interactions between intracellular metabolites and exogenous stress, and will improve the understanding of host-microbe interactions.
Collapse
Affiliation(s)
- Zhong Yu
- School of Environmental Science and Engineering, Sun Yat-sen University, Guangzhou, China.,Guangdong Provincial Key Laboratory of Environmental Pollution Control and Remediation Technology, Sun Yat-sen University, Guangzhou, China
| | - Dongchen Shi
- School of Environmental Science and Engineering, Sun Yat-sen University, Guangzhou, China.,Guangdong Provincial Key Laboratory of Environmental Pollution Control and Remediation Technology, Sun Yat-sen University, Guangzhou, China
| | - Wencong Liu
- School of Environmental Science and Engineering, Sun Yat-sen University, Guangzhou, China.,Guangdong Provincial Key Laboratory of Environmental Pollution Control and Remediation Technology, Sun Yat-sen University, Guangzhou, China
| | - Yabing Meng
- School of Environmental Science and Engineering, Sun Yat-sen University, Guangzhou, China.,Guangdong Provincial Key Laboratory of Environmental Pollution Control and Remediation Technology, Sun Yat-sen University, Guangzhou, China
| | - Fangang Meng
- School of Environmental Science and Engineering, Sun Yat-sen University, Guangzhou, China.,Guangdong Provincial Key Laboratory of Environmental Pollution Control and Remediation Technology, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
14
|
Feng C, Ma Z, Yang D, Li X, Zhang J, Li Y. A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features. Front Bioeng Biotechnol 2020; 8:285. [PMID: 32432088 PMCID: PMC7214540 DOI: 10.3389/fbioe.2020.00285] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 03/18/2020] [Indexed: 11/13/2022] Open
Abstract
The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.
Collapse
Affiliation(s)
- Changli Feng
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Zhaogui Ma
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Deyun Yang
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Xin Li
- College of Information Science and Technology, Taishan University, Tai’an, China
| | - Jun Zhang
- Department of Rehabilitation, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Yanjuan Li
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
15
|
Manavalan B, Subramaniyam S, Shin TH, Kim MO, Lee G. Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. J Proteome Res 2018; 17:2715-2726. [PMID: 29893128 DOI: 10.1021/acs.jproteome.8b00148] [Citation(s) in RCA: 143] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Cell-penetrating peptides (CPPs) can enter cells as a variety of biologically active conjugates and have various biomedical applications. To offset the cost and effort of designing novel CPPs in laboratories, computational methods are necessitated to identify candidate CPPs before in vitro experimental studies. We developed a two-layer prediction framework called machine-learning-based prediction of cell-penetrating peptides (MLCPPs). The first-layer predicts whether a given peptide is a CPP or non-CPP, whereas the second-layer predicts the uptake efficiency of the predicted CPPs. To construct a two-layer prediction framework, we employed four different machine-learning methods and five different compositions including amino acid composition (AAC), dipeptide composition, amino acid index, composition-transition-distribution, and physicochemical properties (PCPs). In the first layer, hybrid features (combination of AAC and PCP) and extremely randomized tree outperformed state-of-the-art predictors in CPP prediction with an accuracy of 0.896 when tested on independent data sets, whereas in the second layer, hybrid features obtained through feature selection protocol and random forest produced an accuracy of 0.725 that is better than state-of-the-art predictors. We anticipate that our method MLCPP will become a valuable tool for predicting CPPs and their uptake efficiency and might facilitate hypothesis-driven experimental design. The MLCPP server interface along with the benchmarking and independent data sets are freely accessible at www.thegleelab.org/MLCPP .
Collapse
Affiliation(s)
- Balachandran Manavalan
- Department of Physiology , Ajou University School of Medicine , Suwon 443380 , Republic of Korea
| | | | - Tae Hwan Shin
- Department of Physiology , Ajou University School of Medicine , Suwon 443380 , Republic of Korea.,Institute of Molecular Science and Technology , Ajou University , Suwon 443721 , Republic of Korea
| | - Myeong Ok Kim
- Division of Life Science and Applied Life Science (BK21 Plus), College of Natural Sciences , Gyeongsang National University , Jinju 52828 , Republic of Korea
| | - Gwang Lee
- Department of Physiology , Ajou University School of Medicine , Suwon 443380 , Republic of Korea.,Institute of Molecular Science and Technology , Ajou University , Suwon 443721 , Republic of Korea
| |
Collapse
|
16
|
A novel strategy to improve the thermostability of Penicillium camembertii mono- and di-acylglycerol lipase. Biochem Biophys Res Commun 2018; 500:639-644. [DOI: 10.1016/j.bbrc.2018.04.123] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2018] [Accepted: 04/14/2018] [Indexed: 01/24/2023]
|
17
|
Meher PK, Sahu TK, Mohanty J, Gahoi S, Purru S, Grover M, Rao AR. nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine. Front Microbiol 2018; 9:1100. [PMID: 29896173 PMCID: PMC5986947 DOI: 10.3389/fmicb.2018.01100] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 05/08/2018] [Indexed: 11/13/2022] Open
Abstract
As inorganic nitrogen compounds are essential for basic building blocks of life (e.g., nucleotides and amino acids), the role of biological nitrogen-fixation (BNF) is indispensible. All nitrogen fixing microbes rely on the same nitrogenase enzyme for nitrogen reduction, which is in fact an enzyme complex consists of as many as 20 genes. However, the occurrence of six genes viz., nifB, nifD, nifE, nifH, nifK, and nifN has been proposed to be essential for a functional nitrogenase enzyme. Therefore, identification of these genes is important to understand the mechanism of BNF as well as to explore the possibilities for improving BNF from agricultural sustainability point of view. Further, though the computational tools are available for the annotation and phylogenetic analysis of nifH gene sequences alone, to the best of our knowledge no tool is available for the computational prediction of the above mentioned six categories of nitrogen-fixation (nif) genes or proteins. Thus, we proposed an approach, which is first of its kind for the computational identification of nif proteins encoded by the six categories of nif genes. Sequence-derived features were employed to map the input sequences into vectors of numeric observations that were subsequently fed to the support vector machine as input. Two types of classifier were constructed: (i) a binary classifier for classification of nif and non-nitrogen-fixation (non-nif) proteins, and (ii) a multi-class classifier for classification of six categories of nif proteins. Higher accuracies were observed for the combination of composition-transition-distribution (CTD) feature set and radial kernel, as compared to the other feature-kernel combinations. The overall accuracies were observed >90% in both binary and multi-class classifications. The developed approach further achieved >92% accuracy, while evaluated with blind (independent) test datasets. The developed approach also produced higher accuracy in identifying nif proteins, while evaluated using proteome-wide datasets of several species. Furthermore, we established a prediction server nifPred (http://webapp.cabgrid.res.in/nifPred) to assist the scientific community for proteome-wide identification of six categories of nif proteins. Besides, the source code of nifPred is also available at https://github.com/PrabinaMeher/nifPred. The developed web server is expected to supplement the transcriptional profiling and comparative genomics studies for the identification and functional annotation of genes related to BNF.
Collapse
Affiliation(s)
- Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Tanmaya K Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Jyotilipsa Mohanty
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.,Department of Bioinformatics, Orissa University of Agriculture and Technology, Bhubaneswar, India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Supriya Purru
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Monendra Grover
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
18
|
Wang S, Yue Y. Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm. PLoS One 2018; 13:e0195636. [PMID: 29649330 PMCID: PMC5896989 DOI: 10.1371/journal.pone.0195636] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 03/26/2018] [Indexed: 01/03/2023] Open
Abstract
A wide variety of methods have been proposed in protein subnuclear localization to improve the prediction accuracy. However, one important trend of these means is to treat fusion representation by fusing multiple feature representations, of which, the fusion process takes a lot of time. In view of this, this paper novelly proposed a method by combining a new single feature representation and a new algorithm to obtain good recognition rate. Specifically, based on the position-specific scoring matrix (PSSM), we proposed a new expression, correlation position-specific scoring matrix (CoPSSM) as the protein feature representation. Based on the classic nonlinear dimension reduction algorithm, kernel linear discriminant analysis (KLDA), we added a new discriminant criterion and proposed a dichotomous greedy genetic algorithm (DGGA) to intelligently select its kernel bandwidth parameter. Two public datasets with Jackknife test and KNN classifier were used for the numerical experiments. The results showed that the overall success rate (OSR) with single representation CoPSSM is larger than that with many relevant representations. The OSR of the proposed method can reach as high as 87.444% and 90.3361% for these two datasets, respectively, outperforming many current methods. To show the generalization of the proposed algorithm, two extra standard datasets of protein subcellular were chosen to conduct the expending experiment, and the prediction accuracy by Jackknife test and Independent test is still considerable.
Collapse
Affiliation(s)
- Shunfang Wang
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
- * E-mail:
| | - Yaoting Yue
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
| |
Collapse
|
19
|
Meher PK, Sahu TK, Gahoi S, Rao AR. ir-HSP: Improved Recognition of Heat Shock Proteins, Their Families and Sub-types Based On g-Spaced Di-peptide Features and Support Vector Machine. Front Genet 2018; 8:235. [PMID: 29379521 PMCID: PMC5770798 DOI: 10.3389/fgene.2017.00235] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 12/27/2017] [Indexed: 12/24/2022] Open
Abstract
Heat shock proteins (HSPs) play a pivotal role in cell growth and variability. Since conventional approaches are expensive and voluminous protein sequence information is available in the post-genomic era, development of an automated and accurate computational tool is highly desirable for prediction of HSPs, their families and sub-types. Thus, we propose a computational approach for reliable prediction of all these components in a single framework and with higher accuracy as well. The proposed approach achieved an overall accuracy of ~84% in predicting HSPs, ~97% in predicting six different families of HSPs, and ~94% in predicting four types of DnaJ proteins, with bench mark datasets. The developed approach also achieved higher accuracy as compared to most of the existing approaches. For easy prediction of HSPs by experimental scientists, a user friendly web server ir-HSP is made freely accessible at http://cabgrid.res.in:8080/ir-hsp. The ir-HSP was further evaluated for proteome-wide identification of HSPs by using proteome datasets of eight different species, and ~50% of the predicted HSPs in each species were found to be annotated with InterPro HSP families/domains. Thus, the developed computational method is expected to supplement the currently available approaches for prediction of HSPs, to the extent of their families and sub-types.
Collapse
Affiliation(s)
- Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Tanmaya K Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
20
|
Tang H, Cao RZ, Wang W, Liu TS, Wang LM, He CM. A two-step discriminated method to identify thermophilic proteins. INT J BIOMATH 2017. [DOI: 10.1142/s1793524517500504] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Improving thermostability of an enzyme can accelerate the relevant chemical reaction. Thus, the analysis and prediction of thermophilic proteins are conducive to protein engineering and enzyme design. In this study, a novel method based on two-step discrimination was proposed to distinguish between thermophilic and non-thermophilic proteins. The model was rigorously benchmarked on an objective dataset including 915 thermophilic proteins and 793 non-thermophilic proteins. Results showed that the overall accuracy of our method is 94.44% in 5-fold cross-validation, which is higher than those of other published methods. We believe that the two-step discriminated strategy will become a promising method in the relevant field of protein bioinformatics.
Collapse
Affiliation(s)
- Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, P. R. China
| | - Ren-Zhi Cao
- Computer Science Department, Pacific Lutheran University, Tacoma WA 98447, USA
| | - Wen Wang
- Computer Science Department, Pacific Lutheran University, Tacoma WA 98447, USA
| | - Tie-Shan Liu
- Maize Institute, Shandong Academy of Agricultural Science, Jinan 250100, P. R. China
| | - Li-Ming Wang
- Maize Institute, Shandong Academy of Agricultural Science, Jinan 250100, P. R. China
| | - Chun-Mei He
- Maize Institute, Shandong Academy of Agricultural Science, Jinan 250100, P. R. China
| |
Collapse
|
21
|
Meher PK, Sahu TK, Banchariya A, Rao AR. DIRProt: a computational approach for discriminating insecticide resistant proteins from non-resistant proteins. BMC Bioinformatics 2017; 18:190. [PMID: 28340571 PMCID: PMC5364559 DOI: 10.1186/s12859-017-1587-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 03/09/2017] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Insecticide resistance is a major challenge for the control program of insect pests in the fields of crop protection, human and animal health etc. Resistance to different insecticides is conferred by the proteins encoded from certain class of genes of the insects. To distinguish the insecticide resistant proteins from non-resistant proteins, no computational tool is available till date. Thus, development of such a computational tool will be helpful in predicting the insecticide resistant proteins, which can be targeted for developing appropriate insecticides. RESULTS Five different sets of feature viz., amino acid composition (AAC), di-peptide composition (DPC), pseudo amino acid composition (PAAC), composition-transition-distribution (CTD) and auto-correlation function (ACF) were used to map the protein sequences into numeric feature vectors. The encoded numeric vectors were then used as input in support vector machine (SVM) for classification of insecticide resistant and non-resistant proteins. Higher accuracies were obtained under RBF kernel than that of other kernels. Further, accuracies were observed to be higher for DPC feature set as compared to others. The proposed approach achieved an overall accuracy of >90% in discriminating resistant from non-resistant proteins. Further, the two classes of resistant proteins i.e., detoxification-based and target-based were discriminated from non-resistant proteins with >95% accuracy. Besides, >95% accuracy was also observed for discrimination of proteins involved in detoxification- and target-based resistance mechanisms. The proposed approach not only outperformed Blastp, PSI-Blast and Delta-Blast algorithms, but also achieved >92% accuracy while assessed using an independent dataset of 75 insecticide resistant proteins. CONCLUSIONS This paper presents the first computational approach for discriminating the insecticide resistant proteins from non-resistant proteins. Based on the proposed approach, an online prediction server DIRProt has also been developed for computational prediction of insecticide resistant proteins, which is accessible at http://cabgrid.res.in:8080/dirprot/ . The proposed approach is believed to supplement the efforts needed to develop dynamic insecticides in wet-lab by targeting the insecticide resistant proteins.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India
| | - Anjali Banchariya
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.,Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat, 250611, Uttar Pradesh, India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| |
Collapse
|
22
|
Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA. Int J Mol Sci 2015; 16:30343-61. [PMID: 26703574 PMCID: PMC4691178 DOI: 10.3390/ijms161226237] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Revised: 12/07/2015] [Accepted: 12/11/2015] [Indexed: 01/01/2023] Open
Abstract
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
Collapse
|
23
|
Govindan G, Nair AS. Bagging with CTD--a novel signature for the hierarchical prediction of secreted protein trafficking in eukaryotes. GENOMICS PROTEOMICS & BIOINFORMATICS 2013; 11:385-90. [PMID: 24316328 PMCID: PMC4357838 DOI: 10.1016/j.gpb.2013.07.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2013] [Revised: 07/01/2013] [Accepted: 07/17/2013] [Indexed: 11/19/2022]
Abstract
Protein trafficking or protein sorting in eukaryotes is a complicated process and is carried out based on the information contained in the protein. Many methods reported prediction of the subcellular location of proteins from sequence information. However, most of these prediction methods use a flat structure or parallel architecture to perform prediction. In this work, we introduce ensemble classifiers with features that are extracted directly from full length protein sequences to predict locations in the protein-sorting pathway hierarchically. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and their performances were compared. When evaluated by independent data testing, ensemble based-bagging algorithms with sequence feature composition, transition and distribution (CTD) successfully classified two datasets with accuracies greater than 90%. We compared our results with similar published methods, and our method equally performed with the others at two levels in the secreted pathway. This study shows that the feature CTD extracted from protein sequences is effective in capturing biological features among compartments in secreted pathways.
Collapse
Affiliation(s)
- Geetha Govindan
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram 695581, India.
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram 695581, India
| |
Collapse
|
24
|
Support vector machine with a Pearson VII function kernel for discriminating halophilic and non-halophilic proteins. Comput Biol Chem 2013; 46:16-22. [DOI: 10.1016/j.compbiolchem.2013.05.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Revised: 04/24/2013] [Accepted: 05/03/2013] [Indexed: 01/15/2023]
|
25
|
Determination of amino acids and dipeptides is correlated significantly with optimum temperatures of microbial lipases. ANN MICROBIOL 2013. [DOI: 10.1007/s13213-012-0475-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022] Open
|
26
|
Van Damme P, Hole K, Pimenta-Marques A, Helsens K, Vandekerckhove J, Martinho RG, Gevaert K, Arnesen T. NatF contributes to an evolutionary shift in protein N-terminal acetylation and is important for normal chromosome segregation. PLoS Genet 2011; 7:e1002169. [PMID: 21750686 PMCID: PMC3131286 DOI: 10.1371/journal.pgen.1002169] [Citation(s) in RCA: 151] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2011] [Accepted: 05/20/2011] [Indexed: 01/31/2023] Open
Abstract
N-terminal acetylation (N-Ac) is a highly abundant eukaryotic protein modification. Proteomics revealed a significant increase in the occurrence of N-Ac from lower to higher eukaryotes, but evidence explaining the underlying molecular mechanism(s) is currently lacking. We first analysed protein N-termini and their acetylation degrees, suggesting that evolution of substrates is not a major cause for the evolutionary shift in N-Ac. Further, we investigated the presence of putative N-terminal acetyltransferases (NATs) in higher eukaryotes. The purified recombinant human and Drosophila homologues of a novel NAT candidate was subjected to in vitro peptide library acetylation assays. This provided evidence for its NAT activity targeting Met-Lys- and other Met-starting protein N-termini, and the enzyme was termed Naa60p and its activity NatF. Its in vivo activity was investigated by ectopically expressing human Naa60p in yeast followed by N-terminal COFRADIC analyses. hNaa60p acetylated distinct Met-starting yeast protein N-termini and increased general acetylation levels, thereby altering yeast in vivo acetylation patterns towards those of higher eukaryotes. Further, its activity in human cells was verified by overexpression and knockdown of hNAA60 followed by N-terminal COFRADIC. NatF's cellular impact was demonstrated in Drosophila cells where NAA60 knockdown induced chromosomal segregation defects. In summary, our study revealed a novel major protein modifier contributing to the evolution of N-Ac, redundancy among NATs, and an essential regulator of normal chromosome segregation. With the characterization of NatF, the co-translational N-Ac machinery appears complete since all the major substrate groups in eukaryotes are accounted for.
Collapse
Affiliation(s)
- Petra Van Damme
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Kristine Hole
- Department of Molecular Biology, University of Bergen, Bergen, Norway
- Department of Surgical Sciences, University of Bergen, Bergen, Norway
| | | | - Kenny Helsens
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Joël Vandekerckhove
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | | | - Kris Gevaert
- Department of Medical Protein Research, Ghent University, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Thomas Arnesen
- Department of Molecular Biology, University of Bergen, Bergen, Norway
- Department of Surgery, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
27
|
Ma X, Guo J, Wu J, Liu H, Yu J, Xie J, Sun X. Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature. Proteins 2011; 79:1230-9. [DOI: 10.1002/prot.22958] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2010] [Revised: 11/02/2010] [Accepted: 11/24/2010] [Indexed: 11/10/2022]
|
28
|
New Feature Vector for Apoptosis Protein Subcellular Localization Prediction. ADVANCES IN COMPUTING AND COMMUNICATIONS 2011. [DOI: 10.1007/978-3-642-22709-7_30] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
29
|
Chakravorty D, Parameswaran S, Dubey VK, Patra S. In silico characterization of thermostable lipases. Extremophiles 2010; 15:89-103. [PMID: 21153672 DOI: 10.1007/s00792-010-0337-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2010] [Accepted: 11/15/2010] [Indexed: 11/28/2022]
Abstract
Thermostable lipases are of high priority for industrial applications as they are endowed with the capability of carrying out diversified reactions at elevated temperatures. Extremophiles are their potential source. Sequence and structure annotation of thermostable lipases can elucidate evolution of lipases from their mesophilic counterparts with enhanced thermostability hence better industrial potential. Sequence analysis highlighted the conserved residues in bacterial and fungal thermostable lipases. Higher frequency of AXXXA motif and poly Ala residues in lid domain of thermostable Bacillus lipases were distinguishing characteristics. Comparison of amino acid composition among thermostable and mesostable lipases brought into light the role of neutral, charged and aromatic amino acid residues in enhancement of thermostability. Structural annotation of thermostable lipases with that of mesostable lipases revealed some striking features which are increment of gamma turns in thermostable lipases; being first time reported in our paper, longer beta strands, lesser beta-branched residues in helices, increase in charged-neutral hydrogen bonding pair, hydrophobic-hydrophobic contact and differences in the N-cap and C-cap residues of the α helices. Conclusively, it can be stated that subtle changes in the arrangement of amino acid residues in the tertiary structure of lipases contributes to enhanced thermostability.
Collapse
Affiliation(s)
- Debamitra Chakravorty
- Department of Biotechnology, Indian Institute of Technology, Guwahati 781039, Assam, India
| | | | | | | |
Collapse
|
30
|
Lin H, Chen W. Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 2010; 84:67-70. [PMID: 21044646 DOI: 10.1016/j.mimet.2010.10.013] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 10/15/2010] [Accepted: 10/19/2010] [Indexed: 11/16/2022]
Abstract
The thermostability of proteins is particularly relevant for enzyme engineering. Developing a computational method to identify mesophilic proteins would be helpful for protein engineering and design. In this work, we developed support vector machine based method to predict thermophilic proteins using the information of amino acid distribution and selected amino acid pairs. A reliable benchmark dataset including 915 thermophilic proteins and 793 non-thermophilic proteins was constructed for training and testing the proposed models. Results showed that 93.8% thermophilic proteins and 92.7% non-thermophilic proteins could be correctly predicted by using jackknife cross-validation. High predictive successful rate exhibits that this model can be applied for designing stable proteins.
Collapse
Affiliation(s)
- Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | | |
Collapse
|
31
|
Wang J, Wang J, Gong X, Zheng G. A novel esterase Sso2518 from Sulfolobus solfataricus with a much lower temperature optimum than the growth temperature. Biotechnol Lett 2010; 32:1103-8. [DOI: 10.1007/s10529-010-0261-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2010] [Accepted: 03/23/2010] [Indexed: 10/19/2022]
|
32
|
Gromiha MM, Suresh MX. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 2008; 70:1274-9. [PMID: 17876820 DOI: 10.1002/prot.21616] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Discriminating thermophilic proteins from their mesophilic counterparts is a challenging task and it would help to design stable proteins. In this work, we have systematically analyzed the amino acid compositions of 3075 mesophilic and 1609 thermophilic proteins belonging to 9 and 15 families, respectively. We found that the charged residues Lys, Arg, and Glu as well as the hydrophobic residues, Val and Ile have higher occurrence in thermophiles than mesophiles. Further, we have analyzed the performance of different methods, based on Bayes rules, logistic functions, neural networks, support vector machines, decision trees and so forth for discriminating mesophilic and thermophilic proteins. We found that most of the machine learning techniques discriminate these classes of proteins with similar accuracy. The neural network-based method could discriminate the thermophiles from mesophiles at the five-fold cross-validation accuracy of 89% in a dataset of 4684 proteins. Moreover, this method is tested with 325 mesophiles in Xylella fastidosa and 382 thermophiles in Aquifex aeolicus and it could successfully discriminate them with the accuracy of 91%. These accuracy levels are better than other methods in the literature and we suggest that this method could be effectively used to discriminate mesophilic and thermophilic proteins.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
33
|
Zhang G, Fang B. The influence of dipeptide composition on optimum temperature of alcohol dehydrogenase. Enzyme Microb Technol 2006. [DOI: 10.1016/j.enzmictec.2006.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
34
|
Zhang G, Fang B. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 2006. [DOI: 10.1016/j.procbio.2006.03.026] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
35
|
Liu L, Dong H, Wang S, Chen H, Shao W. Computational analysis of di-peptides correlated with the optimal temperature in G/11 xylanase. Process Biochem 2006. [DOI: 10.1016/j.procbio.2005.06.027] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|