1
|
Arif M, Musleh S, Ghulam A, Fida H, Alqahtani Y, Alam T. StackDPPred: Multiclass prediction of defensin peptides using stacked ensemble learning with optimized features. Methods 2024; 230:129-139. [PMID: 39173785 DOI: 10.1016/j.ymeth.2024.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 07/30/2024] [Accepted: 08/13/2024] [Indexed: 08/24/2024] Open
Abstract
Host defense or antimicrobial peptides (AMPs) are promising candidates for protecting host against microbial pathogens for example bacteria, virus, fungi, yeast. Defensins are the type of AMPs that act as potential therapeutic drug agent and perform vital role in various biological process. Conventional Experiments to identify defensin peptides (DPs) are time consuming and expensive. Thus, the shortcomings of wet lab experiments are leveraged by computational methods to accurately predict the functional types of DPs. In this paper, we aim to propose a novel multi-class ensemble-based prediction model called StackDPPred for identifying the properties of DPs. The peptide sequences are encoded using split amino acid composition (SAAC), segmented position specific scoring matrix (SegPSSM), histogram of oriented gradients-based PSSM (HOGPSSM) and feature extraction based graphical and statistical (FEGS) descriptors. Next, principal component analysis (PCA) is used to select the best subset of attributes. After that, the optimized features are fed into single machine learning and stacking-based ensemble classifiers. Furthermore, the ablation study demonstrates the robustness and efficacy of the stacking approach using reduced features for predicting DPs and their families. The proposed StackDPPred method improves the overall accuracy by 13.41% and 7.62% compared to existing DPs predictors iDPF-PseRAAC and iDEF-PseRAAC, respectively on validation test. Additionally, we applied the local interpretable model-agnostic explanations (LIME) algorithm to understand the contribution of selected features to the overall prediction. We believe, StackDPPred could serve as a valuable tool accelerating the screening of large-scale DPs and peptide-based drug discovery process.
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Huma Fida
- Department of Microbiology, Abdul Wali Khan University Mardan, 23200, KPK, Pakistan
| | | | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
2
|
Abstract
OBJECTIVE To summarize the current research progress of machine learning and venous thromboembolism. METHODS The literature on risk factors, diagnosis, prevention and prognosis of machine learning and venous thromboembolism in recent years was reviewed. RESULTS Machine learning is the future of biomedical research, personalized medicine, and computer-aided diagnosis, and will significantly promote the development of biomedical research and healthcare. However, many medical professionals are not familiar with it. In this review, we will introduce several commonly used machine learning algorithms in medicine, discuss the application of machine learning in venous thromboembolism, and reveal the challenges and opportunities of machine learning in medicine. CONCLUSION The incidence of venous thromboembolism is high, the diagnostic measures are diverse, and it is necessary to classify and treat machine learning, and machine learning as a research tool, it is more necessary to strengthen the special research of venous thromboembolism and machine learning.
Collapse
Affiliation(s)
- Shirong Zou
- West China Hospital of Medicine, West China Hospital Operation Room /West China School of Nursing, Sichuan University, Chengdu, China
| | - Zhoupeng Wu
- Department of vascular surgery, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
3
|
Arif M, Fang G, Ghulam A, Musleh S, Alam T. DPI_CDF: druggable protein identifier using cascade deep forest. BMC Bioinformatics 2024; 25:145. [PMID: 38580921 PMCID: PMC11334562 DOI: 10.1186/s12859-024-05744-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 03/13/2024] [Indexed: 04/07/2024] Open
Abstract
BACKGROUND Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. METHODS In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. RESULTS The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. AVAILABILITY The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Ge Fang
- State Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing 210023, P. R. China, Nanjing 210023, China
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bankok, 10700, Thailand
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
4
|
Abbass J, Parisi C. Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets. J Biomol Struct Dyn 2024:1-16. [PMID: 38505995 DOI: 10.1080/07391102.2024.2328736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 03/05/2024] [Indexed: 03/21/2024]
Abstract
In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.
Collapse
Affiliation(s)
- Jad Abbass
- School of Computer Science and Mathematics, Kingston University, London, UK
| | - Charles Parisi
- School of Computer Science and Mathematics, Kingston University, London, UK
- Telecom Physique Strasbourg, Strasbourg University, Strasbourg, France
| |
Collapse
|
5
|
Shen J, Xia Y, Lu Y, Lu W, Qian M, Wu H, Fu Q, Chen J. Identification of membrane protein types via deep residual hypergraph neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:20188-20212. [PMID: 38052642 DOI: 10.3934/mbe.2023894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
A membrane protein's functions are significantly associated with its type, so it is crucial to identify the types of membrane proteins. Conventional computational methods for identifying the species of membrane proteins tend to ignore two issues: High-order correlation among membrane proteins and the scenarios of multi-modal representations of membrane proteins, which leads to information loss. To tackle those two issues, we proposed a deep residual hypergraph neural network (DRHGNN), which enhances the hypergraph neural network (HGNN) with initial residual and identity mapping in this paper. We carried out extensive experiments on four benchmark datasets of membrane proteins. In the meantime, we compared the DRHGNN with recently developed advanced methods. Experimental results showed the better performance of DRHGNN on the membrane protein classification task on four datasets. Experiments also showed that DRHGNN can handle the over-smoothing issue with the increase of the number of model layers compared with HGNN. The code is available at https://github.com/yunfighting/Identification-of-Membrane-Protein-Types-via-deep-residual-hypergraph-neural-network.
Collapse
Affiliation(s)
- Jiyun Shen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Yiyi Xia
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Yiming Lu
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
- Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, China
| | - Meiling Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Jing Chen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| |
Collapse
|
6
|
Qian Y, Ding Y, Zou Q, Guo F. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1234-1245. [PMID: 35857734 DOI: 10.1109/tcbb.2022.3191325] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Membrane proteins are the main undertaker of biomembrane functions and play a vital role in many biological activities of organisms. Prediction of membrane protein types has a great help in determining the function of proteins and understanding the interactions of membrane proteins. However, the biochemical experiment is expensive and not suitable for the large-scale identification of membrane protein types. Therefore, computational methods were used to improve the efficiency of biological experiments. Most existing computational methods only use a single feature of protein, or use multiple features but do not integrate these well. In our study, the protein sequence is described via three different views (features), including amino acid composition, evolutionary information and physicochemical properties of amino acids. To exploit information among all views (features), we introduce a coupling strategy for Kernel Sparse Representation based Classification (KSRC) and construct a new model called Multi-view KSRC (MvKSRC). We implement our method on 4 benchmark data sets of membrane proteins. The comparison results indicate that our method is much superior to all existing methods.
Collapse
|
7
|
Ju Z, Wang SY. Prediction of lysine HMGylation sites using multiple feature extraction and fuzzy support vector machine. Anal Biochem 2023; 663:115032. [PMID: 36592921 DOI: 10.1016/j.ab.2022.115032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 12/25/2022] [Indexed: 12/31/2022]
Abstract
Protein 3-hydroxyl-3-methylglutarylation (HMGylation) is newly discovered lysine acylation modification in mitochondrion. The accurate identification of HMGylation sites is the premise and key to further explore the molecular mechanisms of HMGylation. In this study, a novel bioinformatics tool named HMGPred is developed to predict HMGylation sites. Multiple effective features, including amino acid composition, amino acid factors, binary encoding, and the composition of k-spaced amino acid pairs, are integrated to encode HMGylation sites. And F-score feature ranking with incremental feature selection was used to eliminate redundant features. Moreover, a fuzzy support vector machine algorithm is used to effectively reduce the influence of noise problem by assigning different samples to different fuzzy membership degrees. As illustrated by 10-fold cross-validation, HMGPred achieves a satisfactory performance with an area under receiver operating characteristic curve of 0.9110. Feature analysis indicates that some k-spaced amino acid pair features, such as 'KxxxT' and 'DxxxE', play a critical role in the prediction of HMGylation sites. The results of prediction and analysis might be helpful for investigating the mechanisms of HMGylation. For the convenience of experimental researchers, HMGPred is implemented as a web server at http://123.206.31.171/HMGPred/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China
| |
Collapse
|
8
|
Sun J, Kulandaisamy A, Liu J, Hu K, Gromiha MM, Zhang Y. Machine learning in computational modelling of membrane protein sequences and structures: From methodologies to applications. Comput Struct Biotechnol J 2023; 21:1205-1226. [PMID: 36817959 PMCID: PMC9932300 DOI: 10.1016/j.csbj.2023.01.036] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 01/16/2023] [Accepted: 01/25/2023] [Indexed: 01/29/2023] Open
Abstract
Membrane proteins mediate a wide spectrum of biological processes, such as signal transduction and cell communication. Due to the arduous and costly nature inherent to the experimental process, membrane proteins have long been devoid of well-resolved atomic-level tertiary structures and, consequently, the understanding of their functional roles underlying a multitude of life activities has been hampered. Currently, computational tools dedicated to furthering the structure-function understanding are primarily focused on utilizing intelligent algorithms to address a variety of site-wise prediction problems (e.g., topology and interaction sites), but are scattered across different computing sources. Moreover, the recent advent of deep learning techniques has immensely expedited the development of computational tools for membrane protein-related prediction problems. Given the growing number of applications optimized particularly by manifold deep neural networks, we herein provide a review on the current status of computational strategies mainly in membrane protein type classification, topology identification, interaction site detection, and pathogenic effect prediction. Meanwhile, we provide an overview of how the entire prediction process proceeds, including database collection, data pre-processing, feature extraction, and method selection. This review is expected to be useful for developing more extendable computational tools specific to membrane proteins.
Collapse
Affiliation(s)
- Jianfeng Sun
- Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology, and Musculoskeletal Sciences, University of Oxford, Headington, Oxford OX3 7LD, UK
| | - Arulsamy Kulandaisamy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India
| | - Jacklyn Liu
- UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK
| | - Kai Hu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan 411105, China
| | - M. Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India,Corresponding authors.
| | - Yuan Zhang
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan 411105, China,Corresponding authors.
| |
Collapse
|
9
|
Jan A, Hayat M, Wedyan M, Alturki R, Gazzawe F, Ali H, Alarfaj FK. Target-AMP: Computational prediction of antimicrobial peptides by coupling sequential information with evolutionary profile. Comput Biol Med 2022; 151:106311. [PMID: 36410097 DOI: 10.1016/j.compbiomed.2022.106311] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/02/2022] [Accepted: 11/13/2022] [Indexed: 11/18/2022]
Abstract
Antimicrobial peptides (AMPs) are gaining a lot of attention as cutting-edge treatments for many infectious disorders. The effectiveness of AMPs against bacteria, fungi, and viruses has persisted for a long period, making them the greatest option for addressing the growing problem of antibiotic resistance. Due to their wide-ranging actions, AMPs have become more prominent, particularly in therapeutic applications. The prediction of AMPs has become a difficult task for academics due to the explosive increase of AMPs documented in databases. Wet-lab investigations to find anti-microbial peptides are exceedingly costly, time-consuming, and even impossible for some species. Therefore, in order to choose the optimal AMPs candidate before to the in-vitro trials, an efficient computational method must be developed. In this study, an effort was made to develop a machine learning-based classification system that is effective, accurate, and can distinguish between anti-microbial peptides. The position-specific-scoring-matrix (PSSM), Pseudo Amino acid composition, di-peptide composition, and combination of these three were utilized in the suggested scheme to extract salient aspects from AMPs sequences. The classification techniques K-nearest neighbor (KNN), Random Forest (RF), and Support Vector Machine (SVM) were employed. On the independent dataset and training dataset, the accuracy levels achieved by the suggested predictor (Target-AMP) are 97.07% and 95.71%, respectively. The results show that, when compared to other techniques currently used in the literature, our Target-AMP had the best success rate.
Collapse
Affiliation(s)
- Asad Jan
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| | - Mohammad Wedyan
- Department of Autonomous Systems, Faculty of Artificial Intelligence, Al-Balqa Applied University, Al-Salt, 19117, Jordan
| | - Ryan Alturki
- Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Foziah Gazzawe
- Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Hashim Ali
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Fawaz Khaled Alarfaj
- College of Computer & Information Technology, King Faisal University, Saudi Arabia
| |
Collapse
|
10
|
Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier. BMC Bioinformatics 2022; 23:518. [PMID: 36457083 PMCID: PMC9713954 DOI: 10.1186/s12859-022-04880-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 08/03/2022] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information. RESULTS In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins. CONCLUSIONS Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.
Collapse
|
11
|
A novel deep learning-assisted hybrid network for plasmodium falciparum parasite mitochondrial proteins classification. PLoS One 2022; 17:e0275195. [PMID: 36201724 PMCID: PMC9536844 DOI: 10.1371/journal.pone.0275195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 09/12/2022] [Indexed: 11/18/2022] Open
Abstract
Plasmodium falciparum is a parasitic protozoan that can cause malaria, which is a deadly disease. Therefore, the accurate identification of malaria parasite mitochondrial proteins is essential for understanding their functions and identifying novel drug targets. For classifying protein sequences, several adaptive statistical techniques have been devised. Despite significant gains, prediction performance is still constrained by the lack of appropriate feature descriptors and learning strategies in current systems. Moreover, good ground truth data is important for Artificial Intelligence (AI)-based models but there is a lack of that data in the literature. Therefore, in this work, we propose a novel hybrid network that combines 1D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BGRU) to classify the malaria parasite mitochondrial proteins. Furthermore, we curate a sequential data that are collected from National Center for Biotechnology Information (NCBI) and UniProtKB/Swiss-Prot proteins databanks to prepare a dataset that can be used by the research community for AI-based algorithms evaluation. We obtain 4204 cases after preprocessing of the collected data and denote this set of proteins as PF4204. Finally, we conduct an ablation study on several conventional and deep models using PF4204 and the benchmark PF2095 datasets. The proposed model 'CNN-BGRU' obtains the accuracy values of 0.9096 and 0.9857 on PF4204 and PF2095 datasets, respectively. In addition, the CNN-BGRU is compared with state-of-the-arts, where the results illustrate that it can extract robust features and identify proteins accurately.
Collapse
|
12
|
Hayat M, Tahir M, Alarfaj FK, Alturki R, Gazzawe F. NLP-BCH-Ens: NLP-based intelligent computational model for discrimination of malaria parasite. Comput Biol Med 2022; 149:105962. [DOI: 10.1016/j.compbiomed.2022.105962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 07/29/2022] [Accepted: 08/13/2022] [Indexed: 11/03/2022]
|
13
|
Hosen MF, Mahmud SH, Ahmed K, Chen W, Moni MA, Deng HW, Shoombuatong W, Hasan MM. DeepDNAbP: A deep learning-based hybrid approach to improve the identification of deoxyribonucleic acid-binding proteins. Comput Biol Med 2022; 145:105433. [DOI: 10.1016/j.compbiomed.2022.105433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 03/11/2022] [Accepted: 03/20/2022] [Indexed: 11/03/2022]
|
14
|
Tahir M, Khan F, Hayat M, Alshehri MD. An effective machine learning-based model for the prediction of protein–protein interaction sites in health systems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07024-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
15
|
Ge F, Hu J, Zhu YH, Arif M, Yu DJ. TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble. Comb Chem High Throughput Screen 2022; 25:38-52. [PMID: 33280588 DOI: 10.2174/1386207323666201204140438] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/22/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. MATERIALS AND METHODS In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. RESULTS Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source codes and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023,China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| |
Collapse
|
16
|
Zhang Y, Ni J, Gao Y. RF-SVM: Identification of DNA-binding proteins based on comprehensive feature representation methods and support vector machine. Proteins 2021; 90:395-404. [PMID: 34455627 DOI: 10.1002/prot.26229] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 08/10/2021] [Accepted: 08/24/2021] [Indexed: 01/07/2023]
Abstract
Protein-DNA interactions play an important role in biological progress, such as DNA replication, repair, and modification processes. In order to have a better understanding of its functions, the one of the most important steps is the identification of DNA-binding proteins. We propose a DNA-binding protein predictor, namely, RF-SVM, which contains four types features, that is, pseudo amino acid composition (PseAAC), amino acid distribution (AAD), adjacent amino acid composition frequency (ACF) and Local-DPP. Random Forest algorithm is utilized for selecting top 174 features, which are established the predictor model with the support vector machine (SVM) on training dataset UniSwiss-Tr. Finally, RF-SVM method is compared with other existing methods on test dataset UniSwiss-Tst. The experimental results demonstrated that RF-SVM has accuracy of 84.25%. Meanwhile, we discover that the physicochemical properties of amino acids for OOBM770101(H), CIDH920104(H), MIYS990104(H), NISK860101(H), VINM940103(H), and SNEP660101(A) have contribution to predict DNA-binding proteins. The main code and datasets can gain in https://github.com/NiJianWei996/RF-SVM.
Collapse
Affiliation(s)
- Yanping Zhang
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| | - Jianwei Ni
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| | - Ya Gao
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| |
Collapse
|
17
|
Alballa M, Butler G. Integrative approach for detecting membrane proteins. BMC Bioinformatics 2020; 21:575. [PMID: 33349234 PMCID: PMC7751106 DOI: 10.1186/s12859-020-03891-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 11/18/2020] [Indexed: 11/16/2022] Open
Abstract
Background Membrane proteins are key gates that control various vital cellular functions. Membrane proteins are often detected using transmembrane topology prediction tools. While transmembrane topology prediction tools can detect integral membrane proteins, they do not address surface-bound proteins. In this study, we focused on finding the best techniques for distinguishing all types of membrane proteins. Results This research first demonstrates the shortcomings of merely using transmembrane topology prediction tools to detect all types of membrane proteins. Then, the performance of various feature extraction techniques in combination with different machine learning algorithms was explored. The experimental results obtained by cross-validation and independent testing suggest that applying an integrative approach that combines the results of transmembrane topology prediction and position-specific scoring matrix (Pse-PSSM) optimized evidence-theoretic k nearest neighbor (OET-KNN) predictors yields the best performance. Conclusion The integrative approach outperforms the state-of-the-art methods in terms of accuracy and MCC, where the accuracy reached a 92.51% in independent testing, compared to the 89.53% and 79.42% accuracies achieved by the state-of-the-art methods.
Collapse
Affiliation(s)
- Munira Alballa
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada. .,College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC, Canada.,Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, 24105, Canada
| |
Collapse
|
18
|
Zhang X, Chen L. Prediction of membrane protein types by fusing protein-protein interaction and protein sequence information. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140524. [PMID: 32858174 DOI: 10.1016/j.bbapap.2020.140524] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/17/2020] [Accepted: 07/30/2020] [Indexed: 11/30/2022]
Abstract
Membrane proteins are gatekeepers to the cell and essential for determination of the function of cells. Identification of the types of membrane proteins is an essential problem in cell biology. It is time-consuming and expensive to identify the type of membrane proteins with traditional experimental methods. The alternative way is to design effective computational methods, which can provide quick and reliable predictions. To date, several computational methods have been proposed in this regard. Several of them used the features extracted from the sequence information of individual proteins. Recently, networks are more and more popular to tackle different protein-related problems, which can organize proteins in a system level and give an overview of all proteins. However, such form weakens the essential properties of proteins, such as their sequence information. In this study, a novel feature fusion scheme was proposed, which integrated the information of protein sequences and protein-protein interaction network. The fused features of a protein were defined as the linear combination of sequence features of all proteins in the network, where the combination coefficients were the probabilities yielded by the random walk with restart algorithm with the protein as the seed node. Several models with such fused features and different classification algorithms were built and evaluated. Their performance for predicting the type of membrane proteins was improved compared with the models only with the sequence features or network information.
Collapse
Affiliation(s)
- Xiaolin Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| |
Collapse
|
19
|
MPPIF-Net: Identification of Plasmodium Falciparum Parasite Mitochondrial Proteins Using Deep Features with Multilayer Bi-directional LSTM. Processes (Basel) 2020. [DOI: 10.3390/pr8060725] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Mitochondrial proteins of Plasmodium falciparum (MPPF) are an important target for anti-malarial drugs, but their identification through manual experimentation is costly, and in turn, their related drugs production by pharmaceutical institutions involves a prolonged time duration. Therefore, it is highly desirable for pharmaceutical companies to develop computationally automated and reliable approach to identify proteins precisely, resulting in appropriate drug production in a timely manner. In this direction, several computationally intelligent techniques are developed to extract local features from biological sequences using machine learning methods followed by various classifiers to discriminate the nature of proteins. Unfortunately, these techniques demonstrate poor performance while capturing contextual features from sequence patterns, yielding non-representative classifiers. In this paper, we proposed a sequence-based framework to extract deep and representative features that are trust-worthy for Plasmodium mitochondrial proteins identification. The backbone of the proposed framework is MPPF identification-net (MPPFI-Net), that is based on a convolutional neural network (CNN) with multilayer bi-directional long short-term memory (MBD-LSTM). MPPIF-Net inputs protein sequences, passes through various convolution and pooling layers to optimally extract learned features. We pass these features into our sequence learning mechanism, MBD-LSTM, that is particularly trained to classify them into their relevant classes. Our proposed model is experimentally evaluated on newly prepared dataset PF2095 and two existing benchmark datasets i.e., PF175 and MPD using the holdout method. The proposed method achieved 97.6%, 97.1%, and 99.5% testing accuracy on PF2095, PF175, and MPD datasets, respectively, which outperformed the state-of-the-art approaches.
Collapse
|
20
|
Liu B, Leng L, Sun X, Wang Y, Ma J, Zhu Y. ECMPride: prediction of human extracellular matrix proteins based on the ideal dataset using hybrid features with domain evidence. PeerJ 2020; 8:e9066. [PMID: 32377454 PMCID: PMC7195829 DOI: 10.7717/peerj.9066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 04/05/2020] [Indexed: 01/28/2023] Open
Abstract
Extracellular matrix (ECM) proteins play an essential role in various biological processes in multicellular organisms, and their abnormal regulation can lead to many diseases. For large-scale ECM protein identification, especially through proteomic-based techniques, a theoretical reference database of ECM proteins is required. In this study, based on the experimentally verified ECM datasets and by the integration of protein domain features and a machine learning model, we developed ECMPride, a flexible and scalable tool for predicting ECM proteins. ECMPride achieved excellent performance in predicting ECM proteins, with appropriate balanced accuracy and sensitivity, and the performance of ECMPride was shown to be superior to the previously developed tool. A new theoretical dataset of human ECM components was also established by applying ECMPride to all human entries in the SwissProt database, containing a significant number of putative ECM proteins as well as the abundant biological annotations. This dataset might serve as a valuable reference resource for ECM protein identification.
Collapse
Affiliation(s)
- Binghui Liu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China
| | - Ling Leng
- Department of Central Laboratory, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Xuer Sun
- Tissue Engineering Lab, Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Yunfang Wang
- Tissue Engineering Lab, Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China.,Basic Medical School, Anhui Medical University, Anhui, China
| |
Collapse
|
21
|
Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.103] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
22
|
Lin SY, Miao YR, Hu FF, Hu H, Zhang Q, Li Q, Chen Z, Guo AY. A 6-Membrane Protein Gene score for prognostic prediction of cytogenetically normal acute myeloid leukemia in multiple cohorts. J Cancer 2020; 11:251-259. [PMID: 31892991 PMCID: PMC6930412 DOI: 10.7150/jca.35382] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 09/27/2019] [Indexed: 12/14/2022] Open
Abstract
Background: Cytogenetically normal acute myeloid leukemia (CN-AML) is a large proportion of AMLs with diverse prognostic outcomes. Identifying membrane protein genes as prognostic factors to stratify CN-AML patients will be critical to improve their outcomes. Purpose: This study aims to identify prognostic factors to stratify CN-AML patients to choose better treatments and improve their outcomes. Methods: CN-AML data were from TCGA cohort (n = 79) and four GEO datasets. We identified independent prognostic genes by Cox regression and Kaplan-Meier methods, and constructed linear regression model using LASSO algorithm. The prediction error curve was calculated using R package “pec”. Results: Based on independent prognostic membrane genes, we constructed a regression model for CN-AML prognosis prediction: score = (0.0492 * CD52) - (0.0018 * CD96) + (0.0131 * EMP1) + (0.2058 * TSPAN2) + (0.0234 * STAB1) - (0.3658 * MBTPS1), which was named as MPG6 (6-Membrane Protein Gene) score. Tested in multiple CN-AML datasets, consistent results showed that CN-AML patients with high MPG6 score had poor survival, higher WBC count and shorter EFS. Comparing with other reported scoring models, the benchmark result of MPG6 achieved better association with survival in multiple cohorts. Moreover, by combining with other clinical indicators in CN-AML, MPG6 could improve the performance of survival prediction and serve as a robust prognostic factor. Conclusions: We identified the MPG6 score as a stable indicator with great potential for clinical application in risk stratification and outcome prediction in CN-AML.
Collapse
Affiliation(s)
- Sheng-Yan Lin
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Ya-Ru Miao
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Fei-Fei Hu
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Hui Hu
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Qiong Zhang
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Qiubai Li
- Institute of Hematology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Zhichao Chen
- Institute of Hematology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - An-Yuan Guo
- Hubei Bioinformatics & Molecular Imaging Key Laboratory, Department of Bioinformatics and Systems Biology, Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
| |
Collapse
|
23
|
Javed F, Hayat M. Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou's PseAAC. Genomics 2019; 111:1325-1332. [DOI: 10.1016/j.ygeno.2018.09.004] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Accepted: 09/04/2018] [Indexed: 12/13/2022]
|
24
|
Akbar S, Hayat M, Kabir M, Iqbal M. iAFP-gap-SMOTE: An Efficient Feature Extraction Scheme Gapped Dipeptide Composition is Coupled with an Oversampling Technique for Identification of Antifreeze Proteins. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180816101653] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Antifreeze proteins (AFPs) perform distinguishable roles in maintaining homeostatic conditions of living organisms and protect their cell and body from freezing in extremely cold conditions. Owing to high diversity in protein sequences and structures, the discrimination of AFPs from non- AFPs through experimental approaches is expensive and lengthy. It is, therefore, vastly desirable to propose a computational intelligent and high throughput model that truly reflects AFPs quickly and accurately. In a sequel, a new predictor called “iAFP-gap-SMOTE” is proposed for the identification of AFPs. Protein sequences are expressed by adopting three numerical feature extraction schemes namely; Split Amino Acid Composition, G-gap di-peptide Composition and Reduce Amino Acid alphabet composition. Usually, classification hypothesis biased towards majority class in case of the imbalanced dataset. Oversampling technique Synthetic Minority Over-sampling Technique is employed in order to increase the instances of the lower class and control the biasness. 10-fold cross-validation test is applied to appraise the success rates of “iAFP-gap-SMOTE” model. After the empirical investigation, “iAFP-gap-SMOTE” model obtained 95.02% accuracy. The comparison suggested that the accuracy of” iAFP-gap-SMOTE” model is higher than that of the present techniques in the literature so far. It is greatly recommended that our proposed model “iAFP-gap-SMOTE” might be helpful for the research community and academia.
Collapse
Affiliation(s)
- Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University, Mardan, KP 23200, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, KP 23200, Pakistan
| | - Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University, Mardan, KP 23200, Pakistan
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University, Mardan, KP 23200, Pakistan
| |
Collapse
|
25
|
MFSC: Multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components. J Theor Biol 2019; 463:99-109. [DOI: 10.1016/j.jtbi.2018.12.017] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 12/02/2018] [Accepted: 12/14/2018] [Indexed: 12/29/2022]
|
26
|
Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles. Anal Biochem 2019; 564-565:123-132. [DOI: 10.1016/j.ab.2018.10.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/23/2018] [Accepted: 10/25/2018] [Indexed: 11/17/2022]
|
27
|
Sankari ES, Manimegalai D. Predicting membrane protein types by incorporating a novel feature set into Chou's general PseAAC. J Theor Biol 2018; 455:319-328. [DOI: 10.1016/j.jtbi.2018.07.032] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 06/27/2018] [Accepted: 07/23/2018] [Indexed: 10/28/2022]
|
28
|
Akbar S, Hayat M. iMethyl-STTNC: Identification of N 6-methyladenosine sites by extending the idea of SAAC into Chou's PseAAC to formulate RNA sequences. J Theor Biol 2018; 455:205-211. [PMID: 30031793 DOI: 10.1016/j.jtbi.2018.07.018] [Citation(s) in RCA: 86] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Revised: 07/14/2018] [Accepted: 07/17/2018] [Indexed: 11/17/2022]
Abstract
N6- methyladenosine (m6A) is a vital post-transcriptional modification, which adds another layer of epigenetic regulation at RNA level. It chemically modifies mRNA that effects protein expression. RNA sequence contains many genetic code motifs (GAC). Among these codes, identification of methylated or not methylated GAC motif is highly indispensable. However, with a large number of RNA sequences generated in post-genomic era, it becomes a challenging task how to accurately and speedily characterize these sequences. In view of this, the concept of an intelligent is incorporated with a computational model that truly and fast reflects the motif of the desired classes. An intelligent computational model "iMethyl-STTNC" model is proposed for identification of methyladenosine sites in RNA. In the proposed study, four feature extraction techniques, such as; Pseudo-dinucleotide-composition, Pseudo-trinucleotide-composition, split-trinucleotide-composition, and split-tetra-nucleotides-composition (STTNC) are utilized for genuine numerical descriptors. Three different classification algorithms including probabilistic neural network, Support vector machine (SVM), and K-nearest neighbor are adopted for prediction. After examining the outcomes of prediction model on each feature spaces, SVM using STTNC feature space reported the highest accuracy of 69.84%, 91.84% on dataset1 and dataset2, respectively. The reported results show that our proposed predictor has achieved encouraging results compared to the present approaches, so far in the research. It is finally reckoned that our developed model might be beneficial for in-depth analysis of genomes and drug development.
Collapse
Affiliation(s)
- Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
29
|
Zhang B, Li L, Lü Q. Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network. Biomolecules 2018; 8:biom8020033. [PMID: 29799510 PMCID: PMC6023031 DOI: 10.3390/biom8020033] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 05/18/2018] [Accepted: 05/22/2018] [Indexed: 12/12/2022] Open
Abstract
Residue solvent accessibility is closely related to the spatial arrangement and packing of residues. Predicting the solvent accessibility of a protein is an important step to understand its structure and function. In this work, we present a deep learning method to predict residue solvent accessibility, which is based on a stacked deep bidirectional recurrent neural network applied to sequence profiles. To capture more long-range sequence information, a merging operator was proposed when bidirectional information from hidden nodes was merged for outputs. Three types of merging operators were used in our improved model, with a long short-term memory network performing as a hidden computing node. The trained database was constructed from 7361 proteins extracted from the PISCES server using a cut-off of 25% sequence identity. Sequence-derived features including position-specific scoring matrix, physical properties, physicochemical characteristics, conservation score and protein coding were used to represent a residue. Using this method, predictive values of continuous relative solvent-accessible area were obtained, and then, these values were transformed into binary states with predefined thresholds. Our experimental results showed that our deep learning method improved prediction quality relative to current methods, with mean absolute error and Pearson’s correlation coefficient values of 8.8% and 74.8%, respectively, on the CB502 dataset and 8.2% and 78%, respectively, on the Manesh215 dataset.
Collapse
Affiliation(s)
- Buzhong Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
- School of Computer and Information, Anqing Normal University, Anqing 246011, China.
| | - Linqing Li
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| |
Collapse
|
30
|
Jia C, Yang Q, Zou Q. NucPosPred: Predicting species-specific genomic nucleosome positioning via four different modes of general PseKNC. J Theor Biol 2018; 450:15-21. [PMID: 29678692 DOI: 10.1016/j.jtbi.2018.04.025] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Revised: 04/13/2018] [Accepted: 04/16/2018] [Indexed: 11/20/2022]
Abstract
The nucleosome is the basic structure of chromatin in eukaryotic cells, with essential roles in the regulation of many biological processes, such as DNA transcription, replication and repair, and RNA splicing. Because of the importance of nucleosomes, the factors that determine their positioning within genomes should be investigated. High-resolution nucleosome-positioning maps are now available for organisms including Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans, enabling the identification of nucleosome positioning by application of computational tools. Here, we describe a novel predictor called NucPosPred, which was specifically designed for large-scale identification of nucleosome positioning in C. elegans and D. melanogaster genomes. NucPosPred was separately optimized for each species for four types of DNA sequence feature extraction, with consideration of two classification algorithms (gradient-boosting decision tree and support vector machine). The overall accuracy obtained with NucPosPred was 92.29% for C. elegans and 88.26% for D. melanogaster, outperforming previous methods and demonstrating the potential for species-specific prediction of nucleosome positioning. For the convenience of most experimental scientists, a web-server for the predictor NucPosPred is available at http://121.42.167.206/NucPosPred/index.jsp.
Collapse
Affiliation(s)
- Cangzhi Jia
- Science of College, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
| | - Qing Yang
- Science of College, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.
| |
Collapse
|
31
|
iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition. J Theor Biol 2018; 442:11-21. [DOI: 10.1016/j.jtbi.2018.01.008] [Citation(s) in RCA: 83] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 12/23/2017] [Accepted: 01/10/2018] [Indexed: 02/08/2023]
|
32
|
Bi-PSSM: Position specific scoring matrix based intelligent computational model for identification of mycobacterial membrane proteins. J Theor Biol 2017; 435:116-124. [DOI: 10.1016/j.jtbi.2017.09.013] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2017] [Revised: 09/12/2017] [Accepted: 09/15/2017] [Indexed: 02/08/2023]
|
33
|
Sankari ES, Manimegalai D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J Theor Biol 2017; 435:208-217. [PMID: 28941868 DOI: 10.1016/j.jtbi.2017.09.018] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 09/15/2017] [Accepted: 09/18/2017] [Indexed: 12/19/2022]
Abstract
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier.
Collapse
Affiliation(s)
- E Siva Sankari
- Department of CSE, Government College of Engineering, Tirunelveli, Tamil Nadu, India.
| | - D Manimegalai
- Department of IT, National Engineering College, Kovilpatti, Tamil Nadu, India.
| |
Collapse
|
34
|
A Two-Layer Computational Model for Discrimination of Enhancer and Their Types Using Hybrid Features Pace of Pseudo K-Tuple Nucleotide Composition. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2017. [DOI: 10.1007/s13369-017-2818-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
35
|
Tahir M, Hayat M. iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. MOLECULAR BIOSYSTEMS 2017; 12:2587-93. [PMID: 27271822 DOI: 10.1039/c6mb00221h] [Citation(s) in RCA: 89] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
The nucleosome is the fundamental unit of eukaryotic chromatin, which participates in regulating different cellular processes. Owing to the huge exploration of new DNA primary sequences, it is indispensable to develop an automated model. However, identification of novel protein sequences using conventional methods is difficult or sometimes impossible because of vague motifs and the intricate structure of DNA. In this regard, an effective and high throughput automated model "iNuc-STNC" has been proposed in order to identify accurately and reliably nucleosome positioning in genomes. In this proposed model, DNA sequences are expressed into three distinct feature extraction strategies containing dinucleotide composition, trinucleotide composition and split trinucleotide composition (STNC). Various statistical models were utilized as learner hypotheses. Jackknife test was employed to evaluate the success rates of the proposed model. The experiential results expressed that SVM, in combination with STNC, has obtained an outstanding performance on all benchmark datasets. The predicted outcomes of the proposed model "iNuc-STNC" is higher than current state of the art methods in the literature so far. It is ascertained that the "iNuc-STNC" model will provide a rudimentary framework for the pharmaceutical industry in the development of drug design.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| |
Collapse
|
36
|
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 146:69-75. [PMID: 28688491 DOI: 10.1016/j.cmpb.2017.05.008] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 05/05/2017] [Accepted: 05/19/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVES Enhancers are pivotal DNA elements, which are widely used in eukaryotes for activation of transcription genes. On the basis of enhancer strength, they are further classified into two groups; strong enhancers and weak enhancers. Due to high availability of huge amount of DNA sequences, it is needed to develop fast, reliable and robust intelligent computational method, which not only identify enhancers but also determines their strength. Considerable progress has been achieved in this regard; however, timely and precisely identification of enhancers is still a challenging task. METHODS Two-level intelligent computational model for identification of enhancers and their subgroups is proposed. Two different feature extraction techniques including di-nucleotide composition and tri-nucleotide composition were adopted for extraction of numerical descriptors. Four classification methods including probabilistic neural network, support vector machine, k-nearest neighbor and random forest were utilized for classification. RESULTS The proposed method yielded 77.25% of accuracy for dataset S1 contains enhancers and non-enhancers, whereas 64.70% of accuracy for dataset S2 comprises of strong enhancer and weak enhancer sequences using jackknife cross-validation test. CONCLUSION The predictive results validated that the proposed method is better than that of existing approaches so far reported in the literature. It is thus highly observed that the developed method will be useful and expedient for basic research and academia.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| |
Collapse
|
37
|
Ahmad J, Javed F, Hayat M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif Intell Med 2017; 78:14-22. [DOI: 10.1016/j.artmed.2017.05.001] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2017] [Revised: 04/19/2017] [Accepted: 05/02/2017] [Indexed: 10/19/2022]
|
38
|
Tahir M, Hayat M. Machine learning based identification of protein–protein interactions using derived features of physiochemical properties and evolutionary profiles. Artif Intell Med 2017; 78:61-71. [DOI: 10.1016/j.artmed.2017.06.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 06/09/2017] [Accepted: 06/11/2017] [Indexed: 02/09/2023]
|
39
|
Khan M, Hayat M, Khan SA, Iqbal N. Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. J Theor Biol 2017; 415:13-19. [DOI: 10.1016/j.jtbi.2016.12.004] [Citation(s) in RCA: 88] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Revised: 10/24/2016] [Accepted: 12/07/2016] [Indexed: 01/22/2023]
|
40
|
Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection. Sci Rep 2016; 6:38318. [PMID: 27910954 PMCID: PMC5133563 DOI: 10.1038/srep38318] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2015] [Accepted: 11/08/2016] [Indexed: 12/25/2022] Open
Abstract
Lysine malonylation is an important post-translational modification (PTM) in proteins, and has been characterized to be associated with diseases. However, identifying malonyllysine sites still remains to be a great challenge due to the labor-intensive and time-consuming experiments. In view of this situation, the establishment of a useful computational method and the development of an efficient predictor are highly desired. In this study, a predictor Mal-Lys which incorporated residue sequence order information, position-specific amino acid propensity and physicochemical properties was proposed. A feature selection method of minimum Redundancy Maximum Relevance (mRMR) was used to select optimal ones from the whole features. With the leave-one-out validation, the value of the area under the curve (AUC) was calculated as 0.8143, whereas 6-, 8- and 10-fold cross-validations had similar AUC values which showed the robustness of the predictor Mal-Lys. The predictor also showed satisfying performance in the experimental data from the UniProt database. Meanwhile, a user-friendly web-server for Mal-Lys is accessible at http://app.aporc.org/Mal-Lys/.
Collapse
|
41
|
Butt AH, Rasool N, Khan YD. A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes. J Membr Biol 2016; 250:55-76. [DOI: 10.1007/s00232-016-9937-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Accepted: 11/02/2016] [Indexed: 10/20/2022]
|
42
|
Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.03.025] [Citation(s) in RCA: 57] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
43
|
Protein subcellular localization of fluorescence microscopy images: Employing new statistical and Texton based image features and SVM based ensemble classification. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.01.064] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
44
|
Iqbal M, Hayat M. "iSS-Hyb-mRMR": Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 128:1-11. [PMID: 27040827 DOI: 10.1016/j.cmpb.2016.02.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Accepted: 02/16/2016] [Indexed: 06/05/2023]
Abstract
BACKGROUND AND OBJECTIVES Gene splicing is a vital source of protein diversity. Perfectly eradication of introns and joining exons is the prominent task in eukaryotic gene expression, as exons are usually interrupted by introns. Identification of splicing sites through experimental techniques is complicated and time-consuming task. With the avalanche of genome sequences generated in the post genomic age, it remains a complicated and challenging task to develop an automatic, robust and reliable computational method for fast and effective identification of splicing sites. METHODS In this study, a hybrid model "iSS-Hyb-mRMR" is proposed for quickly and accurately identification of splicing sites. Two sample representation methods namely; pseudo trinucleotide composition (PseTNC) and pseudo tetranucleotide composition (PseTetraNC) were used to extract numerical descriptors from DNA sequences. Hybrid model was developed by concatenating PseTNC and PseTetraNC. In order to select high discriminative features, minimum redundancy maximum relevance algorithm was applied on the hybrid feature space. The performance of these feature representation methods was tested using various classification algorithms including K-nearest neighbor, probabilistic neural network, general regression neural network, and fitting network. Jackknife test was used for evaluation of its performance on two benchmark datasets S1 and S2, respectively. RESULTS The predictor, proposed in the current study achieved an accuracy of 93.26%, sensitivity of 88.77%, and specificity of 97.78% for S1, and the accuracy of 94.12%, sensitivity of 87.14%, and specificity of 98.64% for S2, respectively. CONCLUSION It is observed, that the performance of proposed model is higher than the existing methods in the literature so for; and will be fruitful in the mechanism of RNA splicing, and other research academia.
Collapse
Affiliation(s)
- Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| |
Collapse
|
45
|
Xu Y, Ding J, Wu LY. iSulf-Cys: Prediction of S-sulfenylation Sites in Proteins with Physicochemical Properties of Amino Acids. PLoS One 2016; 11:e0154237. [PMID: 27104833 PMCID: PMC4841585 DOI: 10.1371/journal.pone.0154237] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2015] [Accepted: 04/10/2016] [Indexed: 02/07/2023] Open
Abstract
Cysteine S-sulfenylation is an important post-translational modification (PTM) in proteins, and provides redox regulation of protein functions. Bioinformatics and structural analyses indicated that S-sulfenylation could impact many biological and functional categories and had distinct structural features. However, major limitations for identifying cysteine S-sulfenylation were expensive and low-throughout. In view of this situation, the establishment of a useful computational method and the development of an efficient predictor are highly desired. In this study, a predictor iSulf-Cys which incorporated 14 kinds of physicochemical properties of amino acids was proposed. With the 10-fold cross-validation, the value of area under the curve (AUC) was 0.7155 ± 0.0085, MCC 0.3122 ± 0.0144 on the training dataset for 20 times. iSulf-Cys also showed satisfying performance in the independent testing dataset with AUC 0.7343 and MCC 0.3315. Features which were constructed from physicochemical properties and position were carefully analyzed. Meanwhile, a user-friendly web-server for iSulf-Cys is accessible at http://app.aporc.org/iSulf-Cys/.
Collapse
Affiliation(s)
- Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing 100083, China
- * E-mail:
| | - Jun Ding
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing 100083, China
| | - Ling-Yun Wu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
46
|
Ahmad K, Waris M, Hayat M. Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou's General Pseudo Amino Acid Composition. J Membr Biol 2016; 249:293-304. [PMID: 26746980 DOI: 10.1007/s00232-015-9868-8] [Citation(s) in RCA: 75] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 12/30/2015] [Indexed: 12/15/2022]
Abstract
Mitochondrion is the key organelle of eukaryotic cell, which provides energy for cellular activities. Submitochondrial locations of proteins play crucial role in understanding different biological processes such as energy metabolism, program cell death, and ionic homeostasis. Prediction of submitochondrial locations through conventional methods are expensive and time consuming because of the large number of protein sequences generated in the last few decades. Therefore, it is intensively desired to establish an automated model for identification of submitochondrial locations of proteins. In this regard, the current study is initiated to develop a fast, reliable, and accurate computational model. Various feature extraction methods such as dipeptide composition (DPC), Split Amino Acid Composition, and Composition and Translation were utilized. In order to overcome the issue of biasness, oversampling technique SMOTE was applied to balance the datasets. Several classification learners including K-Nearest Neighbor, Probabilistic Neural Network, and support vector machine (SVM) are used. Jackknife test is applied to assess the performance of classification algorithms using two benchmark datasets. Among various classification algorithms, SVM achieved the highest success rates in conjunction with the condensed feature space of DPC, which are 95.20 % accuracy on dataset SML3-317 and 95.11 % on dataset SML3-983. The empirical results revealed that our proposed model obtained the highest results so far in the literatures. It is anticipated that our proposed model might be useful for future studies.
Collapse
Affiliation(s)
- Khurshid Ahmad
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Muhammad Waris
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan.
| |
Collapse
|
47
|
Ahmad S, Kabir M, Hayat M. Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015; 122:165-174. [PMID: 26233307 DOI: 10.1016/j.cmpb.2015.07.005] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Revised: 06/21/2015] [Accepted: 07/13/2015] [Indexed: 06/04/2023]
Abstract
Heat Shock Proteins (HSPs) are the substantial ingredients for cell growth and viability, which are found in all living organisms. HSPs manage the process of folding and unfolding of proteins, the quality of newly synthesized proteins and protecting cellular homeostatic processes from environmental stress. On the basis of functionality, HSPs are categorized into six major families namely: (i) HSP20 or sHSP (ii) HSP40 or J-proteins types (iii) HSP60 or GroEL/ES (iv) HSP70 (v) HSP90 and (vi) HSP100. Identification of HSPs family and sub-family through conventional approaches is expensive and laborious. It is therefore, highly desired to establish an automatic, robust and accurate computational method for prediction of HSPs quickly and reliably. Regard, a computational model is developed for the prediction of HSPs family. In this model, protein sequences are formulated using three discrete methods namely: Split Amino Acid Composition, Pseudo Amino Acid Composition, and Dipeptide Composition. Several learning algorithms are utilized to choice the best one for high throughput computational model. Leave one out test is applied to assess the performance of the proposed model. The empirical results showed that support vector machine achieved quite promising results using Dipeptide Composition feature space. The predicted outcomes of proposed model are 90.7% accuracy for HSPs dataset and 97.04% accuracy for J-protein types, which are higher than existing methods in the literature so far.
Collapse
Affiliation(s)
- Saeed Ahmad
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Muhammad Kabir
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
| |
Collapse
|
48
|
Zou HL, Xiao X. Predicting the Functional Types of Singleplex and Multiplex Eukaryotic Membrane Proteins via Different Models of Chou's Pseudo Amino Acid Compositions. J Membr Biol 2015; 249:23-9. [PMID: 26458844 DOI: 10.1007/s00232-015-9830-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2015] [Accepted: 07/30/2015] [Indexed: 01/14/2023]
Abstract
Given a membrane protein sequence, how can we identify its type, particularly when a query protein may have the multiplex character, i.e., simultaneously exist at two or more different types. However, most of the existing predictors or methods can only be used to deal with the single-type or "singleplex" membrane proteins. Actually, multiple-type or "multiplex" membrane proteins should not be ignored because they usually posses some unique biological functions worthy of our special notice. In this study, three different models were developed, which have the ability to deal with the systems containing both singleplex and multiplex membrane proteins. The overall success rate thus obtained was 0.6440, indicating that the study may become a very useful high-throughput tool in identifying the functional types of membrane proteins.
Collapse
Affiliation(s)
- Hong-Liang Zou
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333046, China.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333046, China.
- Information School, ZheJiang Textile & Fashion College, Ningbo, 315211, China.
- Gordon Life Science Institute, 53 South Cottage Road, Belmont, MA, 02478, USA.
| |
Collapse
|
49
|
Ali F, Hayat M. Classification of membrane protein types using Voting Feature Interval in combination with Chou's Pseudo Amino Acid Composition. J Theor Biol 2015; 384:78-83. [PMID: 26297889 DOI: 10.1016/j.jtbi.2015.07.034] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 07/15/2015] [Accepted: 07/29/2015] [Indexed: 12/11/2022]
Abstract
Membrane protein is a major constituent of cell, performing numerous crucial functions in the cell. These functions are mostly concerned with membrane protein's types. Initially, membrane proteins types are classified through traditional methods and reasonable results were obtained using these methods. However, due to large exploration of protein sequences in databases, it is very difficult or sometimes impossible to classify through conventional methods, because it is laborious and wasting of time. Therefore, a new powerful discriminating model is indispensable for classification of membrane protein's types with high precision. In this work, a quite promising classification model is developed having effective discriminating power of membrane protein's types. In our classification model, silent features of protein sequences are extracted via Pseudo Amino Acid Composition. Five classification algorithms were utilized. Among these classification algorithms Voting Feature Interval has obtained outstanding performance in all the three datasets. The accuracy of proposed model is 93.9% on dataset S1, 89.33% on S2 and 86.9% on dataset S3, respectively, applying 10-fold cross validation test. The success rates revealed that our proposed model has obtained the utmost outcomes than other existing models in literatures so far and will be played a substantial role in the fields of drug design and pharmaceutical industry.
Collapse
Affiliation(s)
- Farman Ali
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
50
|
Abbass J, Nebel JC. Customised fragments libraries for protein structure prediction based on structural class annotations. BMC Bioinformatics 2015; 16:136. [PMID: 25925397 PMCID: PMC4419399 DOI: 10.1186/s12859-015-0576-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 04/17/2015] [Indexed: 12/05/2022] Open
Abstract
Background Since experimental techniques are time and cost consuming, in silico protein structure prediction is essential to produce conformations of protein targets. When homologous structures are not available, fragment-based protein structure prediction has become the approach of choice. However, it still has many issues including poor performance when targets’ lengths are above 100 residues, excessive running times and sub-optimal energy functions. Taking advantage of the reliable performance of structural class prediction software, we propose to address some of the limitations of fragment-based methods by integrating structural constraints in their fragment selection process. Results Using Rosetta, a state-of-the-art fragment-based protein structure prediction package, we evaluated our proposed pipeline on 70 former CASP targets containing up to 150 amino acids. Using either CATH or SCOP-based structural class annotations, enhancement of structure prediction performance is highly significant in terms of both GDT_TS (at least +2.6, p-values < 0.0005) and RMSD (−0.4, p-values < 0.005). Although CATH and SCOP classifications are different, they perform similarly. Moreover, proteins from all structural classes benefit from the proposed methodology. Further analysis also shows that methods relying on class-based fragments produce conformations which are more relevant to user and converge quicker towards the best model as estimated by GDT_TS (up to 10% in average). This substantiates our hypothesis that usage of structurally relevant templates conducts to not only reducing the size of the conformation space to be explored, but also focusing on a more relevant area. Conclusions Since our methodology produces models the quality of which is up to 7% higher in average than those generated by a standard fragment-based predictor, we believe it should be considered before conducting any fragment-based protein structure prediction. Despite such progress, ab initio prediction remains a challenging task, especially for proteins of average and large sizes. Apart from improving search strategies and energy functions, integration of additional constraints seems a promising route, especially if they can be accurately predicted from sequence alone. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0576-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jad Abbass
- Faculty of Science, Engineering and Computing, Kingston University, London, KT1 2EE, UK.
| | - Jean-Christophe Nebel
- Faculty of Science, Engineering and Computing, Kingston University, London, KT1 2EE, UK.
| |
Collapse
|