1
|
Arif M, Fang G, Fida H, Musleh S, Yu DJ, Alam T. iMRSAPred: Improved Prediction of Anti-MRSA Peptides Using Physicochemical and Pairwise Contact-Energy Properties of Amino Acids. ACS OMEGA 2024; 9:2874-2883. [PMID: 38250405 PMCID: PMC10795061 DOI: 10.1021/acsomega.3c08303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 12/06/2023] [Accepted: 12/13/2023] [Indexed: 01/23/2024]
Abstract
Methicillin-resistant Staphylococcus aureus (MRSA) is a growing concern for human lives worldwide. Anti-MRSA peptides act as potential antibiotic agents and play significant role to combat MRSA infection. Traditional laboratory-based methods for annotating Anti-MRSA peptides are although precise but quite challenging, costly, and time-consuming. Therefore, computational methods capable of identifying Anti-MRSA peptides accelerate the drug designing process for treating bacterial infections. In this study, we developed a novel sequence-based predictor "iMRSAPred" for screening Anti-MRSA peptides by incorporating energy estimation and physiochemical and sequential information. We successfully resolved the skewed imbalance phenomena by using synthetic minority oversampling technique plus Tomek link (SMOTETomek) algorithm. Furthermore, the Shapley additive explanation method was leveraged to analyze the impact of top-ranked features in the prediction task. We evaluated multiple machine learning algorithms, i.e., CatBoost, Cascade Deep Forest, Kernel and Tree Boosting, support vector machine, and HistGBoost classifiers by 10-fold cross-validation and independent testing. The proposed iMRSAPred method significantly improved the overall performance in terms of accuracy and Matthew's correlation coefficient (MCC) by 5.45 and 0.083%, respectively, on the training data set. On the independent data set, iMRSAPred improved accuracy and MCC by 3.98 and 0.055%, respectively. We believe that the proposed method would be useful in large-scale Anti-MRSA peptide prediction and provide insights into other bioactive peptides.
Collapse
Affiliation(s)
- Muhammad Arif
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Ge Fang
- State
Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing University of Posts Telecommunications
9 Wenyuan Road, Nanjing 210023, P. R. China
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bankok 10700, Thailand
| | - Huma Fida
- Department
of Microbiology, Abdul Wali Khan University, Mardan 23200, KPK, Pakistan
| | - Saleh Musleh
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Dong-Jun Yu
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, Nanjing 210023, China
| | - Tanvir Alam
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| |
Collapse
|
2
|
Guo L, Wang Y, Xu X, Cheng KK, Long Y, Xu J, Li S, Dong J. DeepPSP: A Global-Local Information-Based Deep Neural Network for the Prediction of Protein Phosphorylation Sites. J Proteome Res 2020; 20:346-356. [PMID: 33241931 DOI: 10.1021/acs.jproteome.0c00431] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Identification of phosphorylation sites is an important step in the function study and drug design of proteins. In recent years, there have been increasing applications of the computational method in the identification of phosphorylation sites because of its low cost and high speed. Most of the currently available methods focus on using local information around potential phosphorylation sites for prediction and do not take the global information of the protein sequence into consideration. Here, we demonstrated that the global information of protein sequences may be also critical for phosphorylation site prediction. In this paper, a new deep neural network model, called DeepPSP, was proposed for the prediction of protein phosphorylation sites. In the DeepPSP model, two parallel modules were introduced to extract both local and global features from protein sequences. Two squeeze-and-excitation blocks and one bidirectional long short-term memory block were introduced into each module to capture effective representations of the sequences. Comparative studies were carried out to evaluate the performance of DeepPSP, and four other prediction methods using public data sets The F1-score, area under receiver operating characteristic curves (AUROC), and area under precision-recall curves (AUPRC) of DeepPSP were found to be 0.4819, 0.82, and 0.50, respectively, for S/T general site prediction and 0.4206, 0.73, and 0.39, respectively, for Y general site prediction. Compared with the MusiteDeep method, the F1-score, AUROC, and AUPRC of DeepPSP were found to increase by 8.6, 2.5, and 8.7%, respectively, for S/T general site prediction and by 20.6, 5.8, and 18.2%, respectively, for Y general site prediction. Among the tested methods, the developed DeepPSP method was also found to produce best results for different kinase-specific site predictions including CDK, mitogen-activated protein kinase, CAMK, AGC, and CMGC. Taken together, the developed DeepPSP method may offer a more accurate phosphorylation site prediction by including global information. It may serve as an alternative model with better performance and interpretability for protein phosphorylation site prediction.
Collapse
Affiliation(s)
- Lei Guo
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Yongpei Wang
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Xiangnan Xu
- School of Mathematics and Statistics, The University of Sydney, Sydeny, New South Wales 2006, Australia
| | - Kian-Kai Cheng
- Innovation Centre in Agritechnology, Universiti Teknologi Malaysia, Muar, Johor 84600, Malaysia
| | - Yichi Long
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Jingjing Xu
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Sanshu Li
- Institute of Genomics, Medical School, Huaqiao University, Xiamen 361021, China
| | - Jiyang Dong
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| |
Collapse
|
3
|
Alphonse AS, Mary NAB, Starvin MS. Classification of membrane protein using Tetra Peptide Pattern. Anal Biochem 2020; 606:113845. [PMID: 32739352 DOI: 10.1016/j.ab.2020.113845] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/17/2020] [Accepted: 06/22/2020] [Indexed: 11/29/2022]
Abstract
Membrane proteins play an important role in the life activities of organisms. The mechanism of cell structures and biological activities can be identified only by knowing the functional types of membrane proteins which accelerate the process. Therefore, it is greatly necessary to build up computational approaches for timely and accurate prediction of the functional types of membrane protein. The proposed method analyzes the structure of the membrane proteins using novel Tetra Peptide Pattern (TPP)-based feature extraction technique. A frequency occurrence matrix is created from which a feature vector is formed. This feature vector captures the pattern among amino acids in a membrane protein sequence. The feature vector is reduced in the dimension using General Kernel-based Supervised Principal Component Analysis (GKSPCA). Stacked Restricted Boltzmann Machines (RBM) in Deep Belief Network (DBN) is used for classification. The RBM is the building block of Deep Belief Network. The proposed method achieves good results on two datasets. The performance of the proposed method was analyzed using Accuracy, Specificity, Sensitivity and Mathew's correlation coefficient. The proposed method achieves good results when compared to other state-of-the-art techniques.
Collapse
Affiliation(s)
| | | | - M S Starvin
- University College of Engineering, Nagercoil, 629004, India.
| |
Collapse
|
4
|
Messerli MA, Sarkar A. Advances in Electrochemistry for Monitoring Cellular Chemical Flux. Curr Med Chem 2019; 26:4984-5002. [PMID: 31057100 DOI: 10.2174/0929867326666190506111629] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 03/06/2019] [Accepted: 03/12/2019] [Indexed: 11/22/2022]
Abstract
The transport of organic and inorganic molecules, along with inorganic ions across the plasma membrane results in chemical fluxes that reflect the cellular function in healthy and diseased states. Measurement of these chemical fluxes enables the characterization of protein function and transporter stoichiometry, characterization of a single cell and embryo viability prior to implantation, and screening of pharmaceutical agents. Electrochemical sensors emerge as sensitive and non-invasive tools for measuring chemical fluxes immediately outside the cells in the boundary layer, that are capable of monitoring a diverse range of transported analytes including inorganic ions, gases, neurotransmitters, hormones, and pharmaceutical agents. Used on their own or in combination with other methods, these sensors continue to expand our understanding of the function of rare cells and small tissues. Advances in sensor construction and detection strategies continue to improve sensitivity under physiological conditions, diversify analyte detection, and increase throughput. These advances will be discussed in the context of addressing technical challenges to measuring chemical flux in the boundary layer of cells and measuring the resultant changes to the chemical concentration in the bulk media.
Collapse
Affiliation(s)
- Mark A Messerli
- Department of Biology and Microbiology, South Dakota State University, Brookings, SD. United States
| | - Anyesha Sarkar
- Department of Biology and Microbiology, South Dakota State University, Brookings, SD. United States
| |
Collapse
|
5
|
Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles. Anal Biochem 2019; 564-565:123-132. [DOI: 10.1016/j.ab.2018.10.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/23/2018] [Accepted: 10/25/2018] [Indexed: 11/17/2022]
|
6
|
Sankari ES, Manimegalai D. Predicting membrane protein types by incorporating a novel feature set into Chou's general PseAAC. J Theor Biol 2018; 455:319-328. [DOI: 10.1016/j.jtbi.2018.07.032] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 06/27/2018] [Accepted: 07/23/2018] [Indexed: 10/28/2022]
|
7
|
iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition. J Theor Biol 2018; 442:11-21. [DOI: 10.1016/j.jtbi.2018.01.008] [Citation(s) in RCA: 83] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 12/23/2017] [Accepted: 01/10/2018] [Indexed: 02/08/2023]
|
8
|
Sankari ES, Manimegalai D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J Theor Biol 2017; 435:208-217. [PMID: 28941868 DOI: 10.1016/j.jtbi.2017.09.018] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 09/15/2017] [Accepted: 09/18/2017] [Indexed: 12/19/2022]
Abstract
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier.
Collapse
Affiliation(s)
- E Siva Sankari
- Department of CSE, Government College of Engineering, Tirunelveli, Tamil Nadu, India.
| | - D Manimegalai
- Department of IT, National Engineering College, Kovilpatti, Tamil Nadu, India.
| |
Collapse
|
9
|
Butt AH, Rasool N, Khan YD. A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes. J Membr Biol 2016; 250:55-76. [DOI: 10.1007/s00232-016-9937-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Accepted: 11/02/2016] [Indexed: 10/20/2022]
|
10
|
Wan S, Mak MW, Kung SY. Mem-mEN: Predicting Multi-Functional Types of Membrane Proteins by Interpretable Elastic Nets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:706-718. [PMID: 26336143 DOI: 10.1109/tcbb.2015.2474407] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Membrane proteins play important roles in various biological processes within organisms. Predicting the functional types of membrane proteins is indispensable to the characterization of membrane proteins. Recent studies have extended to predicting single- and multi-type membrane proteins. However, existing predictors perform poorly and more importantly, they are often lack of interpretability. To address these problems, this paper proposes an efficient predictor, namely Mem-mEN, which can produce sparse and interpretable solutions for predicting membrane proteins with single- and multi-label functional types. Given a query membrane protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number, which is subsequently classified by a multi-label elastic net (EN) classifier. Experimental results show that Mem-mEN significantly outperforms existing state-of-the-art membrane-protein predictors. Moreover, by using Mem-mEN, 338 out of more than 7,900 GO terms are found to play more essential roles in determining the functional types. Based on these 338 essential GO terms, Mem-mEN can not only predict the functional type of a membrane protein, but also explain why it belongs to that type. For the reader's convenience, the Mem-mEN server is available online at http://bioinfo.eie.polyu.edu.hk/MemmENServer/.
Collapse
|
11
|
Wan S, Mak MW, Kung SY. Mem-ADSVM: A two-layer multi-label predictor for identifying multi-functional types of membrane proteins. J Theor Biol 2016; 398:32-42. [DOI: 10.1016/j.jtbi.2016.03.013] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 03/07/2016] [Accepted: 03/07/2016] [Indexed: 02/06/2023]
|
12
|
A Prediction Model for Membrane Proteins Using Moments Based Features. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8370132. [PMID: 26966690 PMCID: PMC4761391 DOI: 10.1155/2016/8370132] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 01/12/2016] [Indexed: 01/29/2023]
Abstract
The most expedient unit of the human body is its cell. Encapsulated within the cell are many infinitesimal entities and molecules which are protected by a cell membrane. The proteins that are associated with this lipid based bilayer cell membrane are known as membrane proteins and are considered to play a significant role. These membrane proteins exhibit their effect in cellular activities inside and outside of the cell. According to the scientists in pharmaceutical organizations, these membrane proteins perform key task in drug interactions. In this study, a technique is presented that is based on various computationally intelligent methods used for the prediction of membrane protein without the experimental use of mass spectrometry. Statistical moments were used to extract features and furthermore a Multilayer Neural Network was trained using backpropagation for the prediction of membrane proteins. Results show that the proposed technique performs better than existing methodologies.
Collapse
|
13
|
Wu CY, Li QZ, Feng ZX. Non-coding RNA identification based on topology secondary structure and reading frame in organelle genome level. Genomics 2015; 107:9-15. [PMID: 26697761 DOI: 10.1016/j.ygeno.2015.12.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 12/08/2015] [Accepted: 12/12/2015] [Indexed: 10/22/2022]
Abstract
Non-coding RNA (ncRNA) genes make transcripts as same as the encoding genes, and ncRNAs directly function as RNAs rather than serve as blueprints for proteins. As the function of ncRNA is closely related to organelle genomes, it is desirable to explore ncRNA function by confirming its provenance. In this paper, the topology secondary structure, motif and the triplets under three reading frames are considered as parameters of ncRNAs. A method of SVM combining the increment of diversity (ID) algorithm is applied to construct the classifier. When the method is applied to the ncRNA dataset less than 80% sequence identity, the overall accuracies reach 95.57%, 96.40% in the five-fold cross-validation and the jackknife test, respectively. Further, for the independent testing dataset, the average prediction success rate of our method achieved 93.24%. The higher predictive success rates indicate that our method is very helpful for distinguishing ncRNAs from various organelle genomes.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Zhen-Xing Feng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
14
|
Wan S, Mak MW, Kung SY. mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 2015; 382:223-34. [DOI: 10.1016/j.jtbi.2015.06.042] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Revised: 06/25/2015] [Accepted: 06/26/2015] [Indexed: 02/03/2023]
|
15
|
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou's PseAAC. J Theor Biol 2013; 344:31-9. [PMID: 24316387 DOI: 10.1016/j.jtbi.2013.11.017] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Revised: 10/16/2013] [Accepted: 11/24/2013] [Indexed: 01/12/2023]
Abstract
Membrane proteins play important roles in many biochemical processes and are also attractive targets of drug discovery for various diseases. The elucidation of membrane protein types provides clues for understanding the structure and function of proteins. Recently we developed a novel system for predicting protein subnuclear localizations. In this paper, we propose a simplified version of our system for predicting membrane protein types directly from primary protein structures, which incorporates amino acid classifications and physicochemical properties into a general form of pseudo-amino acid composition. In this simplified system, we will design a two-stage multi-class support vector machine combined with a two-step optimal feature selection process, which proves very effective in our experiments. The performance of the present method is evaluated on two benchmark datasets consisting of five types of membrane proteins. The overall accuracies of prediction for five types are 93.25% and 96.61% via the jackknife test and independent dataset test, respectively. These results indicate that our method is effective and valuable for predicting membrane protein types. A web server for the proposed method is available at http://www.juemengt.com/jcc/memty_page.php.
Collapse
Affiliation(s)
- Guo-Sheng Han
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia.
| | - Vo Anh
- School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia
| |
Collapse
|
16
|
Liu B, Wang X, Zou Q, Dong Q, Chen Q. Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile-Based Protein Representation. Mol Inform 2013; 32:775-82. [DOI: 10.1002/minf.201300084] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Accepted: 06/11/2013] [Indexed: 11/12/2022]
|
17
|
Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou's PseAAC. Process Biochem 2013. [DOI: 10.1016/j.procbio.2013.05.012] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
18
|
Hayat M, Khan A. Prediction of Membrane Protein Types Using Pseudo-Amino Acid Composition and Ensemble Classification. ACTA ACUST UNITED AC 2013. [DOI: 10.7763/ijcee.2013.v5.752] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
19
|
Chen YK, Li KB. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. J Theor Biol 2012; 318:1-12. [PMID: 23137835 DOI: 10.1016/j.jtbi.2012.10.033] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2012] [Revised: 10/25/2012] [Accepted: 10/26/2012] [Indexed: 01/04/2023]
Abstract
The type information of un-annotated membrane proteins provides an important hint for their biological functions. The experimental determination of membrane protein types, despite being more accurate and reliable, is not always feasible due to the costly laboratory procedures, thereby creating a need for the development of bioinformatics methods. This article describes a novel computational classifier for the prediction of membrane protein types using proteins' sequences. The classifier, comprising a collection of one-versus-one support vector machines, makes use of the following sequence attributes: (1) the cationic patch sizes, the orientation, and the topology of transmembrane segments; (2) the amino acid physicochemical properties; (3) the presence of signal peptides or anchors; and (4) the specific protein motifs. A new voting scheme was implemented to cope with the multi-class prediction. Both the training and the testing sequences were collected from SwissProt. Homologous proteins were removed such that there is no pair of sequences left in the datasets with a sequence identity higher than 40%. The performance of the classifier was evaluated by a Jackknife cross-validation and an independent testing experiments. Results show that the proposed classifier outperforms earlier predictors in prediction accuracy in seven of the eight membrane protein types. The overall accuracy was increased from 78.3% to 88.2%. Unlike earlier approaches which largely depend on position-specific substitution matrices and amino acid compositions, most of the sequence attributes implemented in the proposed classifier have supported literature evidences. The classifier has been deployed as a web server and can be accessed at http://bsaltools.ym.edu.tw/predmpt.
Collapse
Affiliation(s)
- Yen-Kuang Chen
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec 2, Lih-Nong Street, Taipei, 112, Taiwan, ROC
| | | |
Collapse
|
20
|
Huang WL. Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes. J Theor Biol 2012; 312:105-13. [PMID: 22967952 DOI: 10.1016/j.jtbi.2012.07.027] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Revised: 05/30/2012] [Accepted: 07/28/2012] [Indexed: 11/24/2022]
Abstract
Protein secretion is an important biological process for both eukaryotes and prokaryotes. Several sequence-based methods mainly rely on utilizing various types of complementary features to design accurate classifiers for predicting non-classical secretory proteins. Gene Ontology (GO) terms are increasing informative in predicting protein functions. However, the number of used GO terms is often very large. For example, there are 60,020 GO terms used in the prediction method Euk-mPLoc 2.0 for subcellular localization. This study proposes a novel approach to identify a small set of m top-ranked GO terms served as the only type of input features to design a support vector machine (SVM) based method Sec-GO to predict non-classical secretory proteins in both eukaryotes and prokaryotes. To evaluate the Sec-GO method, two existing methods and their used datasets are adopted for performance comparisons. The Sec-GO method using m=436 GO terms yields an independent test accuracy of 96.7% on mammalian proteins, much better than the existing method SPRED (82.2%) which uses frequencies of tri-peptides and short peptides, secondary structure, and physicochemical properties as input features of a random forest classifier. Furthermore, when applying to Gram-positive bacterial proteins, the Sec-GO with m=158 GO terms has a test accuracy of 94.5%, superior to NClassG+ (90.0%) which uses SVM with several feature types, comprising amino acid composition, di-peptides, physicochemical properties and the position specific weighting matrix. Analysis of the distribution of secretory proteins in a GO database indicates the percentage of the non-classical secretory proteins annotated by GO is larger than that of classical secretory proteins in both eukaryotes and prokaryotes. Of the m top-ranked GO features, the top-four GO terms are all annotated by such subcellular locations as GO:0005576 (Extracellular region). Additionally, the method Sec-GO is easily implemented and its web tool of prediction is available at iclab.life.nctu.edu.tw/secgo.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, No. 110 XueFu Rd., Tou Fen, Miaoli, Taiwan, ROC.
| |
Collapse
|
21
|
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou's pseudo amino acid composition. Amino Acids 2011; 43:545-55. [PMID: 22102053 DOI: 10.1007/s00726-011-1143-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/27/2011] [Indexed: 10/15/2022]
Abstract
Knowledge of the submitochondria location of protein is integral to understanding its function and a necessity in the proteomics era. In this work, a new submitochondria data set is constructed, and an approach for predicting protein submitochondria locations is proposed by combining the amino acid composition, dipeptide composition, reduced physicochemical properties, gene ontology, evolutionary information, and pseudo-average chemical shift. The overall prediction accuracy is 93.57% for the submitochondria location and 97.79% for the three membrane protein types in the mitochondria inner membrane using the algorithm of the increment of diversity combined with the support vector machine. The performance of the pseudo-average chemical shift is excellent. For contrast, the method is also used to predict submitochondria locations in the data set constructed by Du and Li; an accuracy of 94.95% is obtained by our method, which is better than that of other existing methods.
Collapse
Affiliation(s)
- Guo-Liang Fan
- Department of Physics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
22
|
Hayat M, Khan A, Yeasin M. Prediction of membrane proteins using split amino acid and ensemble classification. Amino Acids 2011; 42:2447-60. [DOI: 10.1007/s00726-011-1053-5] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/29/2011] [Indexed: 02/01/2023]
|
23
|
Khoshniat S, Bourgine A, Julien M, Weiss P, Guicheux J, Beck L. The emergence of phosphate as a specific signaling molecule in bone and other cell types in mammals. Cell Mol Life Sci 2011; 68:205-18. [PMID: 20848155 PMCID: PMC11114507 DOI: 10.1007/s00018-010-0527-z] [Citation(s) in RCA: 127] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2010] [Revised: 08/02/2010] [Accepted: 08/31/2010] [Indexed: 02/07/2023]
Abstract
Although considerable advances in our understanding of the mechanisms of phosphate homeostasis and skeleton mineralization have recently been made, little is known about the initial events involving the detection of changes in the phosphate serum concentrations and the subsequent downstream regulation cascade. Recent data has strengthened a long-established hypothesis that a phosphate-sensing mechanism may be present in various organs. Such a phosphate sensor would detect changes in serum or local phosphate concentration and would inform the body, the local environment, or the individual cell. This suggests that phosphate in itself could represent a signal regulating multiple factors necessary for diverse biological processes such as bone or vascular calcification. This review summarizes findings supporting the possibility that phosphate represents a signaling molecule, particularly in bone and cartilage, but also in other tissues. The involvement of various signaling pathways (ERK1/2), transcription factors (Fra-1, Runx2) and phosphate transporters (PiT1, PiT2) is discussed.
Collapse
Affiliation(s)
- Solmaz Khoshniat
- Group STEP (Skeletal Tissue Engineering and Physiopathology), Centre for Osteoarticular and Dental Tissue Engineering (LIOAD), INSERM, U791, 44042 Nantes, France
- UFR Odontologie, Pres UNAM, 44042 Nantes, France
| | - Annabelle Bourgine
- Group STEP (Skeletal Tissue Engineering and Physiopathology), Centre for Osteoarticular and Dental Tissue Engineering (LIOAD), INSERM, U791, 44042 Nantes, France
- UFR Odontologie, Pres UNAM, 44042 Nantes, France
| | - Marion Julien
- Group STEP (Skeletal Tissue Engineering and Physiopathology), Centre for Osteoarticular and Dental Tissue Engineering (LIOAD), INSERM, U791, 44042 Nantes, France
- UFR Odontologie, Pres UNAM, 44042 Nantes, France
| | - Pierre Weiss
- Group STEP (Skeletal Tissue Engineering and Physiopathology), Centre for Osteoarticular and Dental Tissue Engineering (LIOAD), INSERM, U791, 44042 Nantes, France
- UFR Odontologie, Pres UNAM, 44042 Nantes, France
| | - Jérôme Guicheux
- Group STEP (Skeletal Tissue Engineering and Physiopathology), Centre for Osteoarticular and Dental Tissue Engineering (LIOAD), INSERM, U791, 44042 Nantes, France
- UFR Odontologie, Pres UNAM, 44042 Nantes, France
| | - Laurent Beck
- Growth and Signalling Research Center, INSERM, U845, 75015 Paris, France
- Faculté de Médecine, Centre de Recherche, INSERM U845, Université Paris Descartes, 156 Rue de Vaugirard, 75015 Paris, France
| |
Collapse
|
24
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 971] [Impact Index Per Article: 64.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
25
|
Chen L, Qian Z, Fen K, Cai Y. Prediction of interactiveness between small molecules and enzymes by combining gene ontology and compound similarity. J Comput Chem 2010; 31:1766-76. [PMID: 20033913 DOI: 10.1002/jcc.21467] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Determination of whether a small organic molecule interacts with an enzyme can help to understand the molecular and cellular functions of organisms, and the metabolic pathways. In this research, we present a prediction model, by combining compound similarity and enzyme similarity, to predict the interactiveness between small molecules and enzymes. A dataset consisting of 2859 positive couples of small molecule and enzyme and 286,056 negative couples was employed. Compound similarity is a measurement of how similar two small molecules are, proposed by Hattori et al., J Am Chem Soc 2003, 125, 11853 which can be availed at http://www.genome.jp/ligand-bin/search_compound, while enzyme similarity was obtained by three ways, they are blast method, using gene ontology items and functional domain composition. Then a new distance between a pair of couples was established and nearest neighbor algorithm (NNA) was employed to predict the interactiveness of enzymes and small molecules. A data distribution strategy was adopted to get a better data balance between the positive samples and the negative samples during training the prediction model, by singling out one-fourth couples as testing samples and dividing the rest data into seven training datasets-the rest positive samples were added into each training dataset while only the negative samples were divided. In this way, seven NNAs were built. Finally, simple majority voting system was applied to integrate these seven models to predict the testing dataset, which was demonstrated to have better prediction results than using any single prediction model. As a result, the highest overall prediction accuracy achieved 97.30%.
Collapse
Affiliation(s)
- Lei Chen
- Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, People's Republic of China
| | | | | | | |
Collapse
|
26
|
Roslan R, Othman RM, Shah ZA, Kasim S, Asmuni H, Taliba J, Hassan R, Zakaria Z. Utilizing shared interacting domain patterns and Gene Ontology information to improve protein-protein interaction prediction. Comput Biol Med 2010; 40:555-64. [PMID: 20417930 DOI: 10.1016/j.compbiomed.2010.03.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2009] [Revised: 02/07/2010] [Accepted: 03/23/2010] [Indexed: 11/24/2022]
Abstract
Protein-protein interactions (PPIs) play a significant role in many crucial cellular operations such as metabolism, signaling and regulations. The computational methods for predicting PPIs have shown tremendous growth in recent years, but problem such as huge false positive rates has contributed to the lack of solid PPI information. We aimed at enhancing the overlap between computational predictions and experimental results in an effort to partially remove PPIs falsely predicted. The use of protein function predictor named PFP() that are based on shared interacting domain patterns is introduced in this study with the purpose of aiding the Gene Ontology Annotations (GOA). We used GOA and PFP() as agents in a filtering process to reduce false positive pairs in the computationally predicted PPI datasets. The functions predicted by PFP() were extracted from cross-species PPI data in order to assign novel functional annotations for the uncharacterized proteins and also as additional functions for those that are already characterized by the GO (Gene Ontology). The implementation of PFP() managed to increase the chances of finding matching function annotation for the first rule in the filtration process as much as 20%. To assess the capability of the proposed framework in filtering false PPIs, we applied it on the available S. cerevisiae PPIs and measured the performance in two aspects, the improvement made indicated as Signal-to-Noise Ratio (SNR) and the strength of improvement, respectively. The proposed filtering framework significantly achieved better performance than without it in both metrics.
Collapse
Affiliation(s)
- Rosfuzah Roslan
- Laboratory of Computational Intelligence and Biotechnology, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Malaysia
| | | | | | | | | | | | | | | |
Collapse
|
27
|
Chen K, Jiang Y, Du L, Kurgan L. Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs. J Comput Chem 2009; 30:163-72. [DOI: 10.1002/jcc.21053] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
28
|
Toward more accuracy in protein structural bioinformatics: Have novel hybrid modeling procedures been born? J Theor Biol 2009; 256:147. [DOI: 10.1016/j.jtbi.2008.09.027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2008] [Revised: 09/10/2008] [Accepted: 09/20/2008] [Indexed: 11/19/2022]
|
29
|
Shen HB, Chou KC. Identification of proteases and their types. Anal Biochem 2008; 385:153-60. [PMID: 19007742 DOI: 10.1016/j.ab.2008.10.020] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Revised: 10/13/2008] [Accepted: 10/14/2008] [Indexed: 10/21/2022]
Abstract
Called by many as biology's version of Swiss army knives, proteases cut long sequences of amino acids into fragments and regulate most physiological processes. They are vitally important in the life cycle. Different types of proteases have different action mechanisms and biological processes. With the avalanche of protein sequences generated during the postgenomic age, it is highly desirable for both basic research and drug design to develop a fast and reliable method for identifying the types of proteases according to their sequences or even just for whether they are proteases or not. In this article, three recently developed identification methods in this regard are discussed: (i) FunD-PseAAC, (ii) GO-PseAAC, and (iii) FunD-PsePSSM. The first two were established by hybridizing the FunD (functional domain) approach and the GO (gene ontology) approach, respectively, with the PseAAC (pseudo amino acid composition) approach. The third method was established by fusing the FunD approach with the PsePSSM (pseudo position-specific scoring matrix) approach. Of these three methods, only FunD-PsePSSM has provided a server called ProtIdent (protease identifier), which is freely accessible to the public via the website at http://www.csbio.sjtu.edu.cn/bioinf/Protease. For the convenience of users, a step-by-step guide on how to use ProtIdent is illustrated. Meanwhile, the caveat in using ProtIdent and how to understand the success expectancy rate of a statistical predictor are discussed. Finally, the essence of why ProtIdent can yield a high success rate in identifying proteases and their types is elucidated.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200240, China.
| | | |
Collapse
|
30
|
Prediction of membrane protein types by means of wavelet analysis and cascaded neural networks. J Theor Biol 2008; 254:817-20. [PMID: 18692511 DOI: 10.1016/j.jtbi.2008.07.012] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2008] [Revised: 06/29/2008] [Accepted: 07/03/2008] [Indexed: 11/23/2022]
Abstract
In this study, membrane proteins were classified using the information hidden in their sequences. It was achieved by applying the wavelet analysis to the sequences and consequently extracting several features, each of them revealing a proportion of the information content present in the sequence. The resultant features were made normalized and subsequently fed into a cascaded model developed in order to reduce the effect of the existing bias in the dataset, rising from the difference in size of the membrane protein classes. The results indicate an improvement in prediction accuracy of the model in comparison with similar works. The application of the presented model can be extended to other fields of structural biology due to its efficiency, simplicity and flexibility.
Collapse
|
31
|
González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics 2008; 8:750-78. [DOI: 10.1002/pmic.200700638] [Citation(s) in RCA: 145] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
32
|
Jia P, Qian Z, Feng K, Lu W, Li Y, Cai Y. Prediction of membrane protein types in a hybrid space. J Proteome Res 2008; 7:1131-7. [PMID: 18260610 DOI: 10.1021/pr700715c] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Prediction of the types of membrane proteins is of great importance both for genome-wide annotation and for experimental researchers to understand proteins' functions. We describe a new strategy for the prediction of the types of membrane proteins using the Nearest Neighbor Algorithm. We introduced a bipartite feature space consisting of two kinds of disjoint vectors, proteins' domain profile and proteins' physiochemical characters. Jackknife cross validation test shows that a combination of both features greatly improves the prediction accuracy. Furthermore, the contribution of the physiochemical features to the classification of membrane proteins has also been explored using the feature selection method called "mRMR" (Minimum Redundancy, Maximum Relevance) ( IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27 ( 8), 1226- 1238 ). A more compact set of features that are mostly contributive to membrane protein classification are obtained. The analyses highlighted both hydrophobicity and polarity as the most important features. The predictor with 56 most contributive features achieves an acceptable prediction accuracy of 87.02%. Online prediction service is available freely on our Web site http://pcal.biosino.org/TransmembraneProteinClassification.html.
Collapse
Affiliation(s)
- Peilin Jia
- Graduate School of the Chinese Academy of Sciences, 19 Yuquan Road, Beijing 100039, China
| | | | | | | | | | | |
Collapse
|
33
|
Yan C, Hu J, Wang Y. Discrimination of outer membrane proteins using a K-nearest neighbor method. Amino Acids 2008; 35:65-73. [PMID: 18219549 DOI: 10.1007/s00726-007-0628-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 10/28/2007] [Indexed: 11/29/2022]
Abstract
Identification of outer membrane proteins (OMPs) from genome is an important task. This paper presents a k-nearest neighbor (K-NN) method for discriminating outer membrane proteins (OMPs). The method makes predictions based on a weighted Euclidean distance that is computed from residue composition. The method achieves 89.1% accuracy with 0.668 MCC (Matthews correlation coefficient) in discriminating OMPs and non-OMPs. The performance of the method is improved by including homologous information into the calculation of residue composition. The final method achieves an accuracy of 96.1%, with 0.873 MCC, 87.5% sensitivity, and 98.2% specificity. Comparisons with multiple recently published methods show that the method proposed in this study outperforms the others.
Collapse
Affiliation(s)
- C Yan
- Department of Computer Science, Utah State University, Logan, UT 84322-4205, USA.
| | | | | |
Collapse
|
34
|
Pu X, Guo J, Leung H, Lin Y. Prediction of membrane protein types from sequences and position-specific scoring matrices. J Theor Biol 2007; 247:259-65. [PMID: 17433369 DOI: 10.1016/j.jtbi.2007.01.016] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2006] [Revised: 12/22/2006] [Accepted: 01/18/2007] [Indexed: 11/15/2022]
Abstract
Membrane protein plays an important role in some biochemical process such as signal transduction, transmembrane transport, etc. Membrane proteins are usually classified into five types [Chou, K.C., Elrod, D.W., 1999. Prediction of membrane protein types and subcellular locations. Proteins: Struct. Funct. Genet. 34, 137-153] or six types [Chou, K.C., Cai, Y.D., 2005. J. Chem. Inf. Modelling 45, 407-413]. Designing in silico methods to identify and classify membrane protein can help us understand the structure and function of unknown proteins. This paper introduces an integrative approach, IAMPC, to classify membrane proteins based on protein sequences and protein profiles. These modules extract the amino acid composition of the whole profiles, the amino acid composition of N-terminal and C-terminal profiles, the amino acid composition of profile segments and the dipeptide composition of the whole profiles. In the computational experiment, the overall accuracy of the proposed approach is comparable with the functional-domain-based method. In addition, the performance of the proposed approach is complementary to the functional-domain-based method for different membrane protein types.
Collapse
Affiliation(s)
- Xian Pu
- Department of Computer Sciences, The City University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
35
|
Guo J, Pu X, Lin Y, Leung H. Protein subcellular localization based on PSI-BLAST and machine learning. J Bioinform Comput Biol 2007; 4:1181-95. [PMID: 17245809 DOI: 10.1142/s0219720006002405] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2006] [Revised: 08/02/2006] [Accepted: 08/02/2006] [Indexed: 11/18/2022]
Abstract
Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.
Collapse
Affiliation(s)
- Jian Guo
- Laboratory of Statistical Computation, Department of Mathematical Sciences, Tsinghua University, China
| | | | | | | |
Collapse
|
36
|
Yang XG, Luo RY, Feng ZP. Using amino acid and peptide composition to predict membrane protein types. Biochem Biophys Res Commun 2006; 353:164-9. [PMID: 17174938 DOI: 10.1016/j.bbrc.2006.12.004] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2006] [Accepted: 12/01/2006] [Indexed: 11/23/2022]
Abstract
Membrane proteins play an important role in many biological processes and are attractive drug targets. Determination of membrane protein structures or topologies by experimental methods is expensive and time consuming. Effective computational method in predicting the membrane protein types can provide useful information for large amount of protein sequences emerging in the post-genomic era. Although numerous algorithms have addressed this issue, the methods of extracting efficient protein sequence information are very limit. In this study, we provide a method of extracting high order sequence information with the stepwise discriminant analysis. Some important amino acids and peptides that are distinct for different types of the membrane proteins have been identified and their occurrence frequencies in membrane proteins can be used to predict the types of the membrane proteins. Consequently, an accuracy of 86.5% in the cross-validation test, and 99.8% in the resubstitution test has been achieved for a non-redundant dataset, which includes type-I, type-II, multipass transmembrane proteins, lipid chain-anchored and GPI-anchored membrane proteins. The fingerprint features of the identified peptides in each membrane protein type are also discussed.
Collapse
Affiliation(s)
- Xiao-Guang Yang
- Department of Biological Engineering, University of Missouri-Columbia, Columbia, MO 65211, USA
| | | | | |
Collapse
|
37
|
Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics 2006; 7:518. [PMID: 17134515 PMCID: PMC1716183 DOI: 10.1186/1471-2105-7-518] [Citation(s) in RCA: 137] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2006] [Accepted: 11/30/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins. RESULTS By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction. CONCLUSION Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at: http://bioinfo.au.tsinghua.edu.cn/subMito.
Collapse
|
38
|
Guo J, Lin Y, Liu X. GNBSL: A new integrative system to predict the subcellular location for Gram-negative bacteria proteins. Proteomics 2006; 6:5099-105. [PMID: 16955516 DOI: 10.1002/pmic.200600064] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper proposes a new integrative system (GNBSL--Gram-negative bacteria subcellular localization) for subcellular localization specifized on the Gram-negative bacteria proteins. First, the system generates a position-specific frequency matrix (PSFM) and a position-specific scoring matrix (PSSM) for each protein sequence by searching the Swiss-Prot database. Then different features are extracted by four modules from the PSFM and the PSSM. The features include whole-sequence amino acid composition, N- and C-terminus amino acid composition, dipeptide composition, and segment composition. Four probabilistic neural network (PNN) classifiers are used to classify these modules. To further improve the performance, two modules trained by support vector machine (SVM) are added in this system. One module extracts the residue-couple distribution from the amino acid sequence and the other module applies a pairwise profile alignment kernel to measure the local similarity between every two sequences. Finally, an additional SVM is used to fuse the outputs from the six modules. Test on a benchmark dataset shows that the overall success rate of GNBSL is higher than those of PSORT-B, CELLO, and PSLpred. A web server GNBSL can be visited from http://166.111.24.5/webtools/GNBSL/index.htm.
Collapse
Affiliation(s)
- Jian Guo
- Department of Mathematical Sciences, Laboratory of Statistical Computing & Bioinformatics, Tsinghua University, Beijing, P R China.
| | | | | |
Collapse
|
39
|
Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. J Theor Biol 2006; 242:941-6. [PMID: 16806277 DOI: 10.1016/j.jtbi.2006.05.006] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2006] [Revised: 03/28/2006] [Accepted: 05/03/2006] [Indexed: 11/17/2022]
Abstract
Membrane proteins are vitally important for many biological processes and have become an attractive target for both basic research and drug design. Knowledge of membrane protein types often provides useful clues in deducing the functions of uncharacterized membrane proteins. With the unprecedented increasing of newly found protein sequences in the post-genomic era, it is highly demanded to develop an automated method for fast and accurately identifying the types of membrane proteins according to their amino acid sequences. Although quite a few identifiers have been developed in this regard through various approaches, such as covariant discriminant (CD), support vector machine (SVM), artificial neural network (ANN), and K-nearest neighbor (KNN), classifier the way they operate the identification is basically individual. As is well known, wise persons usually take into account the opinions from several experts rather than rely on only one when they are making critical decisions. Likewise, a sophisticated identifier should be trained by several different modes. In view of this, based on the frame of pseudo-amino acid that can incorporate a considerable amount of sequence-order effects, a novel approach called "stacked generalization" or "stacking" has been introduced. Unlike the "bagging" and "boosting" approaches which only combine the classifiers of a same type, the stacking approach can combine several different types of classifiers through a meta-classifier to maximize the generalization accuracy. The results thus obtained were very encouraging. It is anticipated that the stacking approach may also hold a high potential to improve the identification quality for, among many other protein attributes, subcellular location, enzyme family class, protease type, and protein-protein interaction type. The stacked generalization classifier is available as a web-server named "SG-MPt_Pred" at: http://202.120.37.186/bioinf/wangsq/service.htm.
Collapse
|
40
|
Liu H, Yang J, Wang M, Xue L, Chou KC. Using fourier spectrum analysis and pseudo amino acid composition for prediction of membrane protein types. Protein J 2006; 24:385-9. [PMID: 16323044 DOI: 10.1007/s10930-005-7592-4] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Membrane proteins are generally classified into the following five types: (1) type I membrane protein, (2) type II membrane protein, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins, and (5) GPI-anchored membrane proteins. Given the sequence of an uncharacterized membrane protein, how can we identify which one of the above five types it belongs to? This is important because the biological function of a membrane protein is closely correlated with its type. Particularly, with the explosion of protein sequences entering into databanks, it is in high demand to develop an automated method to address this problem. To realize this, the key is to catch the statistical characteristics for each of the five types. However, it is not easy because they are buried in a pile of long and complicated sequences. In this paper, based on the concept of the pseudo amino acid composition (Chou, K. C. (2001). PROTEINS: Structure, Function, and Genetics 43: 246-255), the technique of Fourier spectrum analysis is introduced. By doing so, the sample of a protein is represented by a set of discrete components that can incorporate a considerable amount of the sequence order effects as well as its amino acid composition information. On the basis of such a statistical frame, the support vector machine (SVM) is introduced to perform predictions. High success rates were yielded by the self-consistency test, jackknife test, and independent dataset test, suggesting that the current approach holds a promising potential to become a high throughput tool for membrane protein type prediction as well as other related areas.
Collapse
Affiliation(s)
- Hui Liu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 200030, China
| | | | | | | | | |
Collapse
|
41
|
Zhou GP, Cai YD. Predicting protease types by hybridizing gene ontology and pseudo amino acid composition. Proteins 2006; 63:681-4. [PMID: 16456852 DOI: 10.1002/prot.20898] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Proteases play a vitally important role in regulating most physiological processes. Different types of proteases perform different functions with different biological processes. Therefore, it is highly desired to develop a fast and reliable means to identify the types of proteases according to their sequences, or even just identify whether they are proteases or nonproteases. The avalanche of protein sequences generated in the postgenomic era has made such a challenge become even more critical and urgent. By hybridizing the gene ontology approach and pseudo amino acid composition approach, a powerful predictor called GO-PseAA predictor was introduced to address the problems. To avoid redundancy and bias, demonstrations were performed on a dataset where none of proteins has >/= 25% sequence identity to any other. The overall success rates thus obtained by the jackknife cross-validation test in identifying protease and nonprotease was 91.82%, and that in identifying the protease type was 85.49% among the following five types: (1) aspartic, (2) cysteine, (3) metallo, (4) serine, and (5) threonine. The high jackknife success rates yielded for such a stringent dataset indicate the GO-PseAA predictor is very powerful and might become a useful tool in bioinformatics and proteomics.
Collapse
Affiliation(s)
- Guo-Ping Zhou
- Center for Vascular Biology Research, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts 02115, USA.
| | | |
Collapse
|
42
|
Cai YD, Feng KY, Lu WC, Chou KC. Using LogitBoost classifier to predict protein structural classes. J Theor Biol 2006; 238:172-6. [PMID: 16043193 DOI: 10.1016/j.jtbi.2005.05.034] [Citation(s) in RCA: 105] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2005] [Revised: 05/04/2005] [Accepted: 05/05/2005] [Indexed: 11/19/2022]
Abstract
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai 200436, China
| | | | | | | |
Collapse
|
43
|
Lubec G, Afjehi-Sadat L, Yang JW, John JPP. Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol 2005; 77:90-127. [PMID: 16271823 DOI: 10.1016/j.pneurobio.2005.10.001] [Citation(s) in RCA: 120] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Revised: 09/18/2005] [Accepted: 10/02/2005] [Indexed: 12/29/2022]
Abstract
A large part of mammalian proteomes is represented by hypothetical proteins (HP), i.e. proteins predicted from nucleic acid sequences only and protein sequences with unknown function. Databases are far from being complete and errors are expected. The legion of HP is awaiting experiments to show their existence at the protein level and subsequent bioinformatic handling in order to assign proteins a tentative function is mandatory. Two-dimensional gel-electrophoresis with subsequent mass spectrometrical identification of protein spots is an appropriate tool to search for HP in the high-throughput mode. Spots are identified by MS or by MS/MS measurements (MALDI-TOF, MALDI-TOF-TOF) and subsequent software as e.g. Mascot or ProFound. In many cases proteins can thus be unambiguously identified and characterised; if this is not the case, de novo sequencing or Q-TOF analysis is warranted. If the protein is not identified, the sequence is being sent to databases for BLAST searches to determine identities/similarities or homologies to known proteins. If no significant identity to known structures is observed, the protein sequence is examined for the presence of functional domains (databases PROSITE, PRINTS, InterPro, ProDom, Pfam and SMART), subjected to searches for motifs (ELM) and finally protein-protein interaction databases (InterWeaver, STRING) are consulted or predictions from conformations are performed. We here provide information about hypothetical proteins in terms of protein chemical analysis, independent of antibody availability and specificity and bioinformatic handling to contribute to the extension/completion of protein databases and include original work on HP in the brain to illustrate the processes of HP identification and functional assignment.
Collapse
Affiliation(s)
- Gert Lubec
- Department of Pediatrics, Division of Basic Sciences, Medical University of Vienna, Waehringer Guertel 18-20, A-1090, Vienna, Austria.
| | | | | | | |
Collapse
|
44
|
Shen HB, Yang J, Liu XJ, Chou KC. Using supervised fuzzy clustering to predict protein structural classes. Biochem Biophys Res Commun 2005; 334:577-81. [PMID: 16023077 DOI: 10.1016/j.bbrc.2005.06.128] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2005] [Accepted: 06/12/2005] [Indexed: 11/23/2022]
Abstract
Prediction of protein classification is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called "supervised fuzzy clustering approach" is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of "if-then" fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated through two different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigators [C.T. Zhang, K.C. Chou, G.M. Maggiora. Protein Eng. (1995) 8, 425-435]. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
| | | | | | | |
Collapse
|
45
|
Shen H, Chou KC. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem Biophys Res Commun 2005; 334:288-92. [PMID: 16002049 DOI: 10.1016/j.bbrc.2005.06.087] [Citation(s) in RCA: 128] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2005] [Accepted: 06/14/2005] [Indexed: 10/25/2022]
Abstract
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.
Collapse
Affiliation(s)
- Hongbin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
| | | |
Collapse
|
46
|
Cai YD, Chou KC. Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 2005; 238:395-400. [PMID: 16040052 DOI: 10.1016/j.jtbi.2005.05.035] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2005] [Revised: 05/25/2005] [Accepted: 05/26/2005] [Indexed: 10/25/2022]
Abstract
Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD-PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has 25% sequence identity to any other. The overall success rates obtained with the FunD-PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chain-anchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD-PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, P.O. Box 88, Manchester, M60 1QD, UK.
| | | |
Collapse
|