1
|
Ismail H, White C, Al-Barakati H, Newman RH, Kc DB. FEPS: A Tool for Feature Extraction from Protein Sequence. Methods Mol Biol 2022; 2499:65-104. [PMID: 35696075 DOI: 10.1007/978-1-0716-2317-6_3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Machine learning has become one of the most popular choices for developing computational approaches in protein structural bioinformatics. The ability to extract features from protein sequence/structure often becomes one of the crucial steps for the development of machine learning-based approaches. Over the years, various sequence, structural, and physicochemical descriptors have been developed for proteins and these descriptors have been used to predict/solve various bioinformatics problems. Hence, several feature extraction tools have been developed over the years to help researchers to generate numeric features from protein sequences. Most of these tools have some limitations regarding the number of sequences they can handle and the subsequent preprocessing that is required for the generated features before they can be fed to machine learning methods. Here, we present Feature Extraction from Protein Sequences (FEPS), a toolkit for feature extraction. FEPS is a versatile software package for generating various descriptors from protein sequences and can handle several sequences: the number of which is limited only by the computational resources. In addition, the features extracted from FEPS do not require subsequent processing and are ready to be fed to the machine learning techniques as it provides various output formats as well as the ability to concatenate these generated features. FEPS is made freely available via an online web server as well as a stand-alone toolkit. FEPS, a comprehensive toolkit for feature extraction, will help spur the development of machine learning-based models for various bioinformatics problems.
Collapse
Affiliation(s)
- Hamid Ismail
- Department of Animal Science, North Carolina A&T State University, Greensboro, NC, USA
| | - Clarence White
- Computational Science and Engineering Department, North Carolina A&T State University, Greensboro, NC, USA
| | - Hussam Al-Barakati
- Department of Computer Science, Jamoum University College, Umm Al-Qura University, Jamoum, Saudi Arabia
| | - Robert H Newman
- Department of Biology, North Carolina A&T State University, Greensboro, NC, USA
| | - Dukka B Kc
- Department of Computer Science, Michigan Technological University, Houghton, MI, USA.
| |
Collapse
|
2
|
Furutani Y, Yoshihara Y. Proteomic Analysis of Dendritic Filopodia-Rich Fraction Isolated by Telencephalin and Vitronectin Interaction. Front Synaptic Neurosci 2018; 10:27. [PMID: 30147651 PMCID: PMC6097459 DOI: 10.3389/fnsyn.2018.00027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Accepted: 07/19/2018] [Indexed: 01/13/2023] Open
Abstract
Dendritic filopodia are thin, long, and highly mobile protrusions functioning as spine precursors. By contrast with a wealth of knowledge on molecular profiles in spines, little is known about structural and functional proteins present in dendritic filopodia. To reveal the molecular constituents of dendritic filopodia, we developed a new method for biochemical preparation of proteins enriched in dendritic filopodia, by taking advantage of specific and strong binding between a dendritic filopodial membrane protein, telencephalin, and its extracellular matrix ligand, vitronectin. When vitronectin-coated magnetic microbeads were added onto cultured hippocampal neurons, phagocytic cup-like membrane protrusions were formed on dendrites through the binding to telencephalin. Magnetically purified membrane protrusion fraction was subjected to comprehensive mass spectrometric analysis and 319 proteins were identified, many of which were confirmed to be localized to dendritic filopodia. Thus, this study provides a useful resource for studying molecular mechanisms underlying dendritic development, synapse formation, and plasticity.
Collapse
Affiliation(s)
- Yutaka Furutani
- Laboratory for Neurobiology of Synapse, RIKEN Brain Science Institute, Saitama, Japan
| | - Yoshihiro Yoshihara
- Laboratory for Neurobiology of Synapse, RIKEN Brain Science Institute, Saitama, Japan.,Laboratory for Systems Molecular Ethology, RIKEN Center for Brain Science, Saitama, Japan
| |
Collapse
|
3
|
Hoseini ASH, Mirzarezaee M. Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks. IRANIAN JOURNAL OF BIOTECHNOLOGY 2018; 16:e1933. [PMID: 31457027 PMCID: PMC6697825 DOI: 10.15171/ijb.1933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 01/11/2018] [Accepted: 01/13/2018] [Indexed: 01/09/2023]
Abstract
Background Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from protein sequences. In contrast, protein interactions have been less investigated. Objectives As protein interactions usually occur in the same or adjacent places, using this feature to find the location would be efficient and impressive. This study did not aim at increasing the total accuracy of the conducted research. The study has focused on the features of the proteins’ interaction and their employment which lead to a higher accuracy. Materials and Methods In this study, we have examined the protein interaction network as one of the features for prediction of the protein localization and its effects on the prediction results. In this regards, we have gathered some of the most common features including Amino Acid Composition, Dipeptide Compositions, Pseudo Amino Acid Compositions (PseAAC), Position Specific Scoring Matrix (PSSM), Functional Domain, Gene Ontology information, and the Pair-wise sequence alignment. The results of the classification are compared to the ones using protein interactions. For achieving this goal different machine learning algorithms were tested. Results The best-obtained results of using single feature set obtained using SVM classifier for PseAAC feature. The accuracy of combining all features with PPI data, using the Decision Tree and Random Forest classifiers, was 82.49% and 83.35%, respectively. In another experiment, using just protein interaction data with the different cutting points resulted in obtaining an accuracy of 93.035% for the protein location prediction. Conclusion In total, it was shown that protein(s) interaction has a significant impact on the prediction of the mitochondrial proteins’ location. This feature can separately distinguish the locations well. Using this feature the accuracy of the results is raised up to 5%.
Collapse
Affiliation(s)
| | - Mitra Mirzarezaee
- Department of Computer Engineering, Science and Research branch, Islamic Azad University, Tehran, Iran.,School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
4
|
Gong AGW, Duan R, Wang HY, Dong TTX, Tsim KWK. Calycosin Orchestrates Osteogenesis of Danggui Buxue Tang in Cultured Osteoblasts: Evaluating the Mechanism of Action by Omics and Chemical Knock-out Methodologies. Front Pharmacol 2018; 9:36. [PMID: 29449812 PMCID: PMC5799702 DOI: 10.3389/fphar.2018.00036] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Accepted: 01/12/2018] [Indexed: 01/12/2023] Open
Abstract
Danggui Buxue Tang (DBT), an ancient Chinese herbal decoction commonly used to mitigate menopausal osteoporosis, contains two herbs: Astragali Radix (AR) and Angelicae Sinensis Radix (ASR). The exact efficacy of individual chemical(s) within DBT, or in any herbal mixture, is hard to be revealed. Calycosin and ferulic acid have been reported to be the predominant chemicals found within DBT, and its roles in regulating osteoblastic differentiation have been proposed here. To probe the roles of calycosin and ferulic acid, these chemicals were specifically depleted from the DBT extracts. Here, calycosin-depleted DBT (DBTΔcal) and ferulic acid-depleted DBT (DBTΔfa), generated by semi-preparative HPLC, were coupled with RNA-seq and metabolomics analyses to reveal the synergistic functions of individual chemicals within a complex herbal mixture. The expressions of osteogenic differentiation markers were significantly increased under the treatments of DBT and DBTΔfa. The DBT-induced genes were markedly reduced in the absent of calycosin, i.e., DBTΔcal. In cultured osteoblasts, the DBT-activated Wnt/β-catenin and MAPK/Erk and signaling pathways were greatly affected when calycosin was depleted. By metabolomics analysis in DBT-treated osteoblasts, the profile of metabolites triggered by DBTΔcal showed distinction to that of DBT and/or DBTΔfa. Thus, our findings indicated that calycosin, rather than ferulic acid, could be an indispensable chemical in DBT to orchestrate multi-components of DBT in achieving maximal osteogenic properties.
Collapse
Affiliation(s)
- Amy G W Gong
- HKUST Shenzhen Research Institute, Shenzhen, China.,Division of Life Science and Center for Chinese Medicine, The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| | - Ran Duan
- HKUST Shenzhen Research Institute, Shenzhen, China.,Division of Life Science and Center for Chinese Medicine, The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| | - Huai Y Wang
- HKUST Shenzhen Research Institute, Shenzhen, China.,Division of Life Science and Center for Chinese Medicine, The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| | - Tina T X Dong
- HKUST Shenzhen Research Institute, Shenzhen, China.,Division of Life Science and Center for Chinese Medicine, The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| | - Karl W K Tsim
- HKUST Shenzhen Research Institute, Shenzhen, China.,Division of Life Science and Center for Chinese Medicine, The Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| |
Collapse
|
5
|
Arrighetti N, Cossa G, De Cecco L, Stucchi S, Carenini N, Corna E, Gandellini P, Zaffaroni N, Perego P, Gatti L. PKC-alpha modulation by miR-483-3p in platinum-resistant ovarian carcinoma cells. Toxicol Appl Pharmacol 2016; 310:9-19. [DOI: 10.1016/j.taap.2016.08.005] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Revised: 07/27/2016] [Accepted: 08/05/2016] [Indexed: 12/19/2022]
|
6
|
TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition. J Membr Biol 2015; 248:1005-14. [PMID: 26058944 DOI: 10.1007/s00232-015-9811-z] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 05/19/2015] [Indexed: 11/26/2022]
Abstract
Antifreeze proteins (AFPs) are indispensable for living organisms to survive in an extremely cold environment and have a variety of potential biotechnological applications. The accurate prediction of antifreeze proteins has become an important issue and is urgently needed. Although considerable progress has been made, AFP prediction is still a challenging problem due to the diversity of species. In this study, we proposed a new sequence-based AFP predictor, called TargetFreeze. TargetFreeze utilizes an enhanced feature representation method that weightedly combines multiple protein features and takes the powerful support vector machine as the prediction engine. Computer experiments on benchmark datasets demonstrate the superiority of the proposed TargetFreeze over most recently released AFP predictors. We also implemented a user-friendly web server, which is openly accessible for academic use and is available at http://csbio.njust.edu.cn/bioinf/TargetFreeze. TargetFreeze supplements existing AFP predictors and will have potential applications in AFP-related biotechnology fields.
Collapse
|
7
|
Lin YC, Wang CC, Tung CW. An in silico toxicogenomics approach for inferring potential diseases associated with maleic acid. Chem Biol Interact 2014; 223:38-44. [PMID: 25239558 DOI: 10.1016/j.cbi.2014.09.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2014] [Revised: 07/25/2014] [Accepted: 09/05/2014] [Indexed: 10/24/2022]
Abstract
Maleic acid is a multi-functional chemical widely applied in the manufacturing of polymer products including food packaging. However, the contamination of maleic acid in modified starch has raised the concerns about the effects of chronic exposure to maleic acid on human health. This study proposed a novel toxicogenomics approach for inferring functions, pathways and diseases potentially affected by maleic acid on humans by using known interactions between maleic acid and proteins. Neuronal signal transmission and cell metabolism were identified to be most influenced by maleic acid in this study. The top disease categories inferred to be associated with maleic acid were mental disorder, nervous system diseases, cardiovascular diseases, and cancers. The results from the in silico analysis showed that maleic acid could penetrate the blood-brain barrier to affect the nervous system. Several functions and pathways were further analyzed and identified to give insights into the mechanisms of maleic acid-associated diseases. The toxicogenomics approach may offer both a better understanding of the potential risks of maleic-acid exposure to humans and a direction for future toxicological investigation.
Collapse
Affiliation(s)
- Ying-Chi Lin
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan; Ph.D. Program in Toxicology, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan.
| | - Chia-Chi Wang
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan; Ph.D. Program in Toxicology, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan; Institute of Environmental Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan; National Environmental Health Research Center, National Health Research Institutes, Miaoli County, Taiwan.
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan; Ph.D. Program in Toxicology, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan; National Environmental Health Research Center, National Health Research Institutes, Miaoli County, Taiwan.
| |
Collapse
|
8
|
Zuo YC, Peng Y, Liu L, Chen W, Yang L, Fan GL. Predicting peroxidase subcellular location by hybridizing different descriptors of Chou’ pseudo amino acid patterns. Anal Biochem 2014; 458:14-9. [DOI: 10.1016/j.ab.2014.04.032] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2014] [Revised: 04/22/2014] [Accepted: 04/25/2014] [Indexed: 11/28/2022]
|
9
|
Li X, Wu X, Wu G. Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model. J Theor Biol 2014; 347:84-94. [PMID: 24423409 DOI: 10.1016/j.jtbi.2014.01.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Revised: 10/17/2013] [Accepted: 01/03/2014] [Indexed: 10/25/2022]
Abstract
Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted Gene Ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO Categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
Collapse
Affiliation(s)
- Xiaomei Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| | - Xindong Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China; Department of Computer Science, University of Vermont, Burlington, VT 50405, USA.
| | - Gongqing Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| |
Collapse
|
10
|
Mei S. SVM ensemble based transfer learning for large-scale membrane proteins discrimination. J Theor Biol 2013; 340:105-10. [PMID: 24050851 DOI: 10.1016/j.jtbi.2013.09.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Revised: 09/04/2013] [Accepted: 09/06/2013] [Indexed: 11/16/2022]
Abstract
Membrane proteins play important roles in molecular trans-membrane transport, ligand-receptor recognition, cell-cell interaction, enzyme catalysis, host immune defense response and infectious disease pathways. Up to present, discriminating membrane proteins remains a challenging problem from the viewpoints of biological experimental determination and computational modeling. This work presents SVM ensemble based transfer learning model for membrane proteins discrimination (SVM-TLM). To reduce the data constraints on computational modeling, this method investigates the effectiveness of transferring the homolog knowledge to the target membrane proteins under the framework of probability weighted ensemble learning. As compared to multiple kernel learning based transfer learning model, the method takes the advantages of sparseness based SVM optimization on large data, thus more computationally efficient for large protein data analysis. The experiments on large membrane protein benchmark dataset show that SVM-TLM achieves significantly better cross validation performance than the baseline model.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
11
|
Butler GS, Overall CM. Matrix metalloproteinase processing of signaling molecules to regulate inflammation. Periodontol 2000 2013; 63:123-48. [DOI: 10.1111/prd.12035] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/19/2013] [Indexed: 12/12/2022]
|
12
|
Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61:259-68. [PMID: 23475502 DOI: 10.1007/s10441-013-9181-9] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 02/23/2013] [Indexed: 01/25/2023]
Abstract
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25% was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1% of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4%, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.
Collapse
|
13
|
Wan S, Mak MW, Kung SY. GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition. J Theor Biol 2013; 323:40-8. [PMID: 23376577 DOI: 10.1016/j.jtbi.2013.01.012] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Revised: 01/16/2013] [Accepted: 01/16/2013] [Indexed: 01/03/2023]
Abstract
Prediction of protein subcellular localization is an important yet challenging problem. Recently, several computational methods based on Gene Ontology (GO) have been proposed to tackle this problem and have demonstrated superiority over methods based on other features. Existing GO-based methods, however, do not fully use the GO information. This paper proposes an efficient GO method called GOASVM that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chou's pseudo-amino acid composition. The method first selects a subset of relevant GO terms to form a GO vector space. Then for each protein, the method uses the accession number (AC) of the protein or the ACs of its homologs to find the number of occurrences of the selected GO terms in the Gene Ontology annotation (GOA) database as a means to construct GO vectors for support vector machines (SVMs) classification. With the advantages of GO term frequencies and a new strategy to incorporate useful homologous information, GOASVM can achieve a prediction accuracy of 72.2% on a new independent test set comprising novel proteins that were added to Swiss-Prot six years later than the creation date of the training set. GOASVM and Supplementary materials are available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China.
| | | | | |
Collapse
|
14
|
Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning. J Theor Biol 2012; 310:80-7. [DOI: 10.1016/j.jtbi.2012.06.028] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Revised: 05/12/2012] [Accepted: 06/18/2012] [Indexed: 11/21/2022]
|
15
|
Mei S. Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS One 2012; 7:e37716. [PMID: 22719847 PMCID: PMC3374840 DOI: 10.1371/journal.pone.0037716] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 04/28/2012] [Indexed: 11/19/2022] Open
Abstract
Recent years have witnessed much progress in computational modelling for protein subcellular localization. However, the existing sequence-based predictive models demonstrate moderate or unsatisfactory performance, and the gene ontology (GO) based models may take the risk of performance overestimation for novel proteins. Furthermore, many human proteins have multiple subcellular locations, which renders the computational modelling more complicated. Up to the present, there are far few researches specialized for predicting the subcellular localization of human proteins that may reside in multiple cellular compartments. In this paper, we propose a multi-label multi-kernel transfer learning model for human protein subcellular localization (MLMK-TLM). MLMK-TLM proposes a multi-label confusion matrix, formally formulates three multi-labelling performance measures and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which to further extends our published work GO-TLM (gene ontology based transfer learning model for protein subcellular localization) and MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for multiplex human protein subcellular localization. With the advantages of proper homolog knowledge transfer, comprehensive survey of model performance for novel protein and multi-labelling capability, MLMK-TLM will gain more practical applicability. The experiments on human protein benchmark dataset show that MLMK-TLM significantly outperforms the baseline model and demonstrates good multi-labelling ability for novel human proteins. Some findings (predictions) are validated by the latest Swiss-Prot database. The software can be freely downloaded at http://soft.synu.edu.cn/upload/msy.rar.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, China.
| |
Collapse
|
16
|
Li L, Zhang Y, Zou L, Li C, Yu B, Zheng X, Zhou Y. An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One 2012; 7:e31057. [PMID: 22303481 PMCID: PMC3268814 DOI: 10.1371/journal.pone.0031057] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2011] [Accepted: 12/31/2011] [Indexed: 02/05/2023] Open
Abstract
With the rapid increase of protein sequences in the post-genomic age, it is challenging to develop accurate and automated methods for reliably and quickly predicting their subcellular localizations. Till now, many efforts have been tried, but most of which used only a single algorithm. In this paper, we proposed an ensemble classifier of KNN (k-nearest neighbor) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic proteins based on a voting system. The overall prediction accuracies by the one-versus-one strategy are 78.17%, 89.94% and 75.55% for three benchmark datasets of eukaryotic proteins. The improved prediction accuracies reveal that GO annotations and hydrophobicity of amino acids help to predict subcellular locations of eukaryotic proteins.
Collapse
Affiliation(s)
- Liqi Li
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Yuan Zhang
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Lingyun Zou
- Department of Microbiology, College of Basic Medical Sciences, Third Military Medical University, Chongqing, China
| | - Changqing Li
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Bo Yu
- Department of Orthopedics, Yichun People's Hospital, Yichun, China
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
- Scientific Computing Key Laboratory of Shanghai Universities, Shanghai, China
| | - Yue Zhou
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| |
Collapse
|
17
|
Mei S. Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization. J Theor Biol 2012; 293:121-30. [DOI: 10.1016/j.jtbi.2011.10.015] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2011] [Revised: 10/09/2011] [Accepted: 10/13/2011] [Indexed: 10/16/2022]
|
18
|
Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 2011; 8:391-404. [PMID: 21679119 DOI: 10.1586/epr.11.20] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In the last two decades, the number of the known protein sequences increased very rapidly. However, a knowledge of protein function only exists for a small portion of these sequences. Since the experimental approaches for determining protein functions are costly and time consuming, in silico methods have been introduced to bridge the gap between knowledge of protein sequences and their functions. Knowing the subcellular location of a protein is considered to be a critical step in understanding its biological functions. Many efforts have been undertaken to predict the protein subcellular locations in silico. With the accumulation of available data, the substructures of some subcellular organelles, such as the cell nucleus, mitochondria and chloroplasts, have been taken into consideration by several studies in recent years. These studies create a new research topic, namely 'protein sub-subcellular location prediction', which goes one level deeper than classic protein subcellular location prediction.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | | | | |
Collapse
|
19
|
Mei S, Fei W, Zhou S. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics 2011; 12:44. [PMID: 21284890 PMCID: PMC3039576 DOI: 10.1186/1471-2105-12-44] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2010] [Accepted: 02/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. RESULTS In this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively. CONCLUSIONS Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.
Collapse
Affiliation(s)
- Suyu Mei
- Software College, Shenyang Normal University, Shenyang, PR China.
| | | | | |
Collapse
|
20
|
Scott MS, Boisvert FM, Lamond AI, Barton GJ. PNAC: a protein nucleolar association classifier. BMC Genomics 2011; 12:74. [PMID: 21272300 PMCID: PMC3038921 DOI: 10.1186/1471-2164-12-74] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Accepted: 01/27/2011] [Indexed: 01/11/2023] Open
Abstract
Background Although primarily known as the site of ribosome subunit production, the nucleolus is involved in numerous and diverse cellular processes. Recent large-scale proteomics projects have identified thousands of human proteins that associate with the nucleolus. However, in most cases, we know neither the fraction of each protein pool that is nucleolus-associated nor whether their association is permanent or conditional. Results To describe the dynamic localisation of proteins in the nucleolus, we investigated the extent of nucleolar association of proteins by first collating an extensively curated literature-derived dataset. This dataset then served to train a probabilistic predictor which integrates gene and protein characteristics. Unlike most previous experimental and computational studies of the nucleolar proteome that produce large static lists of nucleolar proteins regardless of their extent of nucleolar association, our predictor models the fluidity of the nucleolus by considering different classes of nucleolar-associated proteins. The new method predicts all human proteins as either nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic or non-nucleolar. Leave-one-out cross validation tests reveal sensitivity values for these four classes ranging from 0.72 to 0.90 and positive predictive values ranging from 0.63 to 0.94. The overall accuracy of the classifier was measured to be 0.85 on an independent literature-based test set and 0.74 using a large independent quantitative proteomics dataset. While the three nucleolar-association groups display vastly different Gene Ontology biological process signatures and evolutionary characteristics, they collectively represent the most well characterised nucleolar functions. Conclusions Our proteome-wide classification of nucleolar association provides a novel representation of the dynamic content of the nucleolus. This model of nucleolar localisation thus increases the coverage while providing accurate and specific annotations of the nucleolar proteome. It will be instrumental in better understanding the central role of the nucleolus in the cell and its interaction with other subcellular compartments.
Collapse
Affiliation(s)
- Michelle S Scott
- Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH, UK.
| | | | | | | |
Collapse
|
21
|
Ma J, Gu H. A novel method for predicting protein subcellular localization based on pseudo amino acid composition. BMB Rep 2010; 43:670-6. [DOI: 10.5483/bmbrep.2010.43.10.670] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|