1
|
Varshney N, Mishra AK. Deep Learning in Phosphoproteomics: Methods and Application in Cancer Drug Discovery. Proteomes 2023; 11:proteomes11020016. [PMID: 37218921 DOI: 10.3390/proteomes11020016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/24/2023] Open
Abstract
Protein phosphorylation is a key post-translational modification (PTM) that is a central regulatory mechanism of many cellular signaling pathways. Several protein kinases and phosphatases precisely control this biochemical process. Defects in the functions of these proteins have been implicated in many diseases, including cancer. Mass spectrometry (MS)-based analysis of biological samples provides in-depth coverage of phosphoproteome. A large amount of MS data available in public repositories has unveiled big data in the field of phosphoproteomics. To address the challenges associated with handling large data and expanding confidence in phosphorylation site prediction, the development of many computational algorithms and machine learning-based approaches have gained momentum in recent years. Together, the emergence of experimental methods with high resolution and sensitivity and data mining algorithms has provided robust analytical platforms for quantitative proteomics. In this review, we compile a comprehensive collection of bioinformatic resources used for the prediction of phosphorylation sites, and their potential therapeutic applications in the context of cancer.
Collapse
Affiliation(s)
- Neha Varshney
- Division of Biological Sciences, Department of Cellular and Molecular Medicine, University of California, San Diego, CA 93093, USA
- Ludwig Institute for Cancer Research, La Jolla, CA 92093, USA
| | - Abhinava K Mishra
- Molecular, Cellular and Developmental Biology Department, University of California, Santa Barbara, CA 93106, USA
| |
Collapse
|
2
|
A Transfer-Learning-Based Deep Convolutional Neural Network for Predicting Leukemia-Related Phosphorylation Sites from Protein Primary Sequences. Int J Mol Sci 2022; 23:ijms23031741. [PMID: 35163663 PMCID: PMC8915183 DOI: 10.3390/ijms23031741] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 01/27/2022] [Accepted: 01/29/2022] [Indexed: 12/27/2022] Open
Abstract
As one of the most important post-translational modifications (PTMs), phosphorylation refers to the binding of a phosphate group with amino acid residues like Ser (S), Thr (T) and Tyr (Y) thus resulting in diverse functions at the molecular level. Abnormal phosphorylation has been proved to be closely related with human diseases. To our knowledge, no research has been reported describing specific disease-associated phosphorylation sites prediction which is of great significance for comprehensive understanding of disease mechanism. In this work, focusing on three types of leukemia, we aim to develop a reliable leukemia-related phosphorylation site prediction models by combing deep convolutional neural network (CNN) with transfer-learning. CNN could automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of leukemia-related phosphorylation site prediction. With the largest dataset of myelogenous leukemia, the optimal models for S/T/Y phosphorylation sites give the AUC values of 0.8784, 0.8328 and 0.7716 respectively. When transferred learning on the small size datasets, the models for T-cell and lymphoid leukemia also give the promising performance by common sharing the optimal parameters. Compared with other five machine-learning methods, our CNN models reveal the superior performance. Finally, the leukemia-related pathogenesis analysis and distribution analysis on phosphorylated proteins along with K-means clustering analysis and position-specific conversation profiles on the phosphorylation site all indicate the strong practical feasibility of our easy-to-use CNN models.
Collapse
|
3
|
Guo X, He H, Yu J, Shi S. PKSPS: a novel method for predicting kinase of specific phosphorylation sites based on maximum weighted bipartite matching algorithm and phosphorylation sequence enrichment analysis. Brief Bioinform 2021; 23:6398688. [PMID: 34661630 DOI: 10.1093/bib/bbab436] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 09/10/2021] [Accepted: 09/21/2021] [Indexed: 11/14/2022] Open
Abstract
With the development of biotechnology, a large number of phosphorylation sites have been experimentally confirmed and collected, but only a few of them have kinase annotations. Since experimental methods to detect kinases at specific phosphorylation sites are expensive and accidental, some computational methods have been proposed to predict the kinase of these sites, but most methods only consider single sequence information or single functional network information. In this study, a new method Predicting Kinase of Specific Phosphorylation Sites (PKSPS) is developed to predict kinases of specific phosphorylation sites in human proteins by combining PKSPS-Net with PKSPS-Seq, which considers protein-protein interaction (PPI) network information and sequence information. For PKSPS-Net, kinase-kinase and substrate-substrate similarity are quantified based on the topological similarity of proteins in the PPI network, and maximum weighted bipartite matching algorithm is proposed to predict kinase-substrate relationship. In PKSPS-Seq, phosphorylation sequence enrichment analysis is used to analyze the similarity of local sequences around phosphorylation sites and predict the kinase of specific phosphorylation sites (KSP). PKSPS has been proved to be more effective than the PKSPS-Net or PKSPS-Seq on different sets of kinases. Further comparison results show that the PKSPS method performs better than existing methods. Finally, the case study demonstrates the effectiveness of the PKSPS in predicting kinases of specific phosphorylation sites. The open source code and data of the PKSPS can be obtained from https://github.com/guoxinyunncu/PKSPS.
Collapse
Affiliation(s)
- Xinyun Guo
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Huan He
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
4
|
Yang H, Wang M, Liu X, Zhao XM, Li A. PhosIDN: an integrated deep neural network for improving protein phosphorylation site prediction by combining sequence and protein-protein interaction information. Bioinformatics 2021; 37:4668-4676. [PMID: 34320631 PMCID: PMC8665744 DOI: 10.1093/bioinformatics/btab551] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 06/22/2021] [Accepted: 07/27/2021] [Indexed: 11/29/2022] Open
Abstract
Motivation Phosphorylation is one of the most studied post-translational modifications, which plays a pivotal role in various cellular processes. Recently, deep learning methods have achieved great success in prediction of phosphorylation sites, but most of them are based on convolutional neural network that may not capture enough information about long-range dependencies between residues in a protein sequence. In addition, existing deep learning methods only make use of sequence information for predicting phosphorylation sites, and it is highly desirable to develop a deep learning architecture that can combine heterogeneous sequence and protein–protein interaction (PPI) information for more accurate phosphorylation site prediction. Results We present a novel integrated deep neural network named PhosIDN, for phosphorylation site prediction by extracting and combining sequence and PPI information. In PhosIDN, a sequence feature encoding sub-network is proposed to capture not only local patterns but also long-range dependencies from protein sequences. Meanwhile, useful PPI features are also extracted in PhosIDN by a PPI feature encoding sub-network adopting a multi-layer deep neural network. Moreover, to effectively combine sequence and PPI information, a heterogeneous feature combination sub-network is introduced to fully exploit the complex associations between sequence and PPI features, and their combined features are used for final prediction. Comprehensive experiment results demonstrate that the proposed PhosIDN significantly improves the prediction performance of phosphorylation sites and compares favorably with existing general and kinase-specific phosphorylation site prediction methods. Availability and implementation PhosIDN is freely available at https://github.com/ustchangyuanyang/PhosIDN. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hangyuan Yang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Xia Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence and Frontiers Center for Brain Science, China.,Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|
5
|
Mahoney KE, Shabanowitz J, Hunt DF. MHC Phosphopeptides: Promising Targets for Immunotherapy of Cancer and Other Chronic Diseases. Mol Cell Proteomics 2021; 20:100112. [PMID: 34129940 PMCID: PMC8724925 DOI: 10.1016/j.mcpro.2021.100112] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2021] [Revised: 05/11/2021] [Accepted: 06/02/2021] [Indexed: 12/27/2022] Open
Abstract
Major histocompatibility complex-associated peptides have been considered as potential immunotherapeutic targets for many years. MHC class I phosphopeptides result from dysregulated cell signaling pathways that are common across cancers and both viral and bacterial infections. These antigens are recognized by central memory T cells from healthy donors, indicating that they are considered antigenic by the immune system and that they are presented across different individuals and diseases. Based on these responses and the similar dysregulation, phosphorylated antigens are promising candidates for prevention or treatment of different cancers as well as a number of other chronic diseases.
Collapse
Affiliation(s)
- Keira E Mahoney
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, USA
| | - Jeffrey Shabanowitz
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, USA.
| | - Donald F Hunt
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, USA; Department of Pathology, University of Virginia, Charlottesville, Virginia, USA.
| |
Collapse
|
6
|
Jamal S, Ali W, Nagpal P, Grover A, Grover S. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J Transl Med 2021; 19:218. [PMID: 34030700 PMCID: PMC8142496 DOI: 10.1186/s12967-021-02851-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 04/18/2021] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Post-translational modification (PTM) is a biological process that alters proteins and is therefore involved in the regulation of various cellular activities and pathogenesis. Protein phosphorylation is an essential process and one of the most-studied PTMs: it occurs when a phosphate group is added to serine (Ser, S), threonine (Thr, T), or tyrosine (Tyr, Y) residue. Dysregulation of protein phosphorylation can lead to various diseases-most commonly neurological disorders, Alzheimer's disease, and Parkinson's disease-thus necessitating the prediction of S/T/Y residues that can be phosphorylated in an uncharacterized amino acid sequence. Despite a surplus of sequencing data, current experimental methods of PTM prediction are time-consuming, costly, and error-prone, so a number of computational methods have been proposed to replace them. However, phosphorylation prediction remains limited, owing to substrate specificity, performance, and the diversity of its features. METHODS In the present study we propose machine-learning-based predictors that use the physicochemical, sequence, structural, and functional information of proteins to classify S/T/Y phosphorylation sites. Rigorous feature selection, the minimum redundancy/maximum relevance approach, and the symmetrical uncertainty method were employed to extract the most informative features to train the models. RESULTS The RF and SVM models generated using diverse feature types in the present study were highly accurate as is evident from good values for different statistical measures. Moreover, independent test sets and benchmark validations indicated that the proposed method clearly outperformed the existing methods, demonstrating its ability to accurately predict protein phosphorylation. CONCLUSIONS The results obtained in the present work indicate that the proposed computational methodology can be effectively used for predicting putative phosphorylation sites further facilitating discovery of various biological processes mechanisms.
Collapse
Affiliation(s)
- Salma Jamal
- JH-Institute of Molecular Medicine, Jamia Hamdard, New Delhi, India
| | - Waseem Ali
- JH-Institute of Molecular Medicine, Jamia Hamdard, New Delhi, India
| | - Priya Nagpal
- School of Biotechnology, Jawaharlal Nehru University, New Delhi, India
| | - Abhinav Grover
- School of Biotechnology, Jawaharlal Nehru University, New Delhi, India
| | - Sonam Grover
- JH-Institute of Molecular Medicine, Jamia Hamdard, New Delhi, India
| |
Collapse
|
7
|
Yang Y, Wang H, Li W, Wang X, Wei S, Liu Y, Xu Y. Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks. BMC Bioinformatics 2021; 22:171. [PMID: 33789579 PMCID: PMC8010967 DOI: 10.1186/s12859-021-04101-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 03/23/2021] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein's function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. METHOD We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. RESULTS In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN . CONCLUSIONS The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
Collapse
Affiliation(s)
- Yingxi Yang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Hui Wang
- Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China
| | - Wen Li
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Xiaobo Wang
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China
| | - Shizhao Wei
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yulong Liu
- No. 15 Research Institute, China Electronics Technology Group Corporation, Beijing, 100083, China
| | - Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
8
|
Chen CW, Huang LY, Liao CF, Chang KP, Chu YW. GasPhos: Protein Phosphorylation Site Prediction Using a New Feature Selection Approach with a GA-Aided Ant Colony System. Int J Mol Sci 2020; 21:E7891. [PMID: 33114312 PMCID: PMC7660635 DOI: 10.3390/ijms21217891] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Revised: 10/20/2020] [Accepted: 10/20/2020] [Indexed: 02/06/2023] Open
Abstract
Protein phosphorylation is one of the most important post-translational modifications, and many biological processes are related to phosphorylation, such as DNA repair, transcriptional regulation and signal transduction and, therefore, abnormal regulation of phosphorylation usually causes diseases. If we can accurately predict human phosphorylation sites, this could help to solve human diseases. Therefore, we developed a kinase-specific phosphorylation prediction system, GasPhos, and proposed a new feature selection approach, called Gas, based on the ant colony system and a genetic algorithm and used performance evaluation strategies focused on different kinases to choose the best learning model. Gas uses the mean decrease Gini index (MDGI) as a heuristic value for path selection and adopts binary transformation strategies and new state transition rules. GasPhos can predict phosphorylation sites for six kinases and showed better performance than other phosphorylation prediction tools. The disease-related phosphorylated proteins that were predicted with GasPhos are also discussed. Finally, Gas can be applied to other issues that require feature selection, which could help to improve prediction performance. GasPhos is available at http://predictor.nchu.edu.tw/GasPhos.
Collapse
Affiliation(s)
- Chi-Wei Chen
- Department of Computer Science and Engineering, National Chung-Hsing University, Taichung City 402, Taiwan;
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung City 402, Taiwan; (L.-Y.H.); (C.-F.L.)
| | - Lan-Ying Huang
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung City 402, Taiwan; (L.-Y.H.); (C.-F.L.)
| | - Chia-Feng Liao
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung City 402, Taiwan; (L.-Y.H.); (C.-F.L.)
| | - Kai-Po Chang
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung City 402, Taiwan
- Department of Pathology, China Medical University Hospital, Taichung 404, Taiwan
| | - Yen-Wei Chu
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung City 402, Taiwan; (L.-Y.H.); (C.-F.L.)
- Institute of Molecular Biology, National Chung Hsing University, Taichung City 402, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung City 402, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung City 402, Taiwan
- Program in Translational Medicine, National Chung Hsing University, Taichung City 402, Taiwan
- Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, Taichung City 402, Taiwan
| |
Collapse
|
9
|
Ahmed S, Kabir M, Arif M, Khan ZU, Yu DJ. DeepPPSite: A deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information. Anal Biochem 2020; 612:113955. [PMID: 32949607 DOI: 10.1016/j.ab.2020.113955] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 08/30/2020] [Accepted: 09/11/2020] [Indexed: 12/29/2022]
Abstract
Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.
Collapse
Affiliation(s)
- Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Zaheer Ullah Khan
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
10
|
Deznabi I, Arabaci B, Koyutürk M, Tastan O. DeepKinZero: zero-shot learning for predicting kinase-phosphosite associations involving understudied kinases. Bioinformatics 2020; 36:3652-3661. [PMID: 32044914 PMCID: PMC7320620 DOI: 10.1093/bioinformatics/btaa013] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 12/17/2019] [Accepted: 01/06/2020] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Protein phosphorylation is a key regulator of protein function in signal transduction pathways. Kinases are the enzymes that catalyze the phosphorylation of other proteins in a target-specific manner. The dysregulation of phosphorylation is associated with many diseases including cancer. Although the advances in phosphoproteomics enable the identification of phosphosites at the proteome level, most of the phosphoproteome is still in the dark: more than 95% of the reported human phosphosites have no known kinases. Determining which kinase is responsible for phosphorylating a site remains an experimental challenge. Existing computational methods require several examples of known targets of a kinase to make accurate kinase-specific predictions, yet for a large body of kinases, only a few or no target sites are reported. RESULTS We present DeepKinZero, the first zero-shot learning approach to predict the kinase acting on a phosphosite for kinases with no known phosphosite information. DeepKinZero transfers knowledge from kinases with many known target phosphosites to those kinases with no known sites through a zero-shot learning model. The kinase-specific positional amino acid preferences are learned using a bidirectional recurrent neural network. We show that DeepKinZero achieves significant improvement in accuracy for kinases with no known phosphosites in comparison to the baseline model and other methods available. By expanding our knowledge on understudied kinases, DeepKinZero can help to chart the phosphoproteome atlas. AVAILABILITY AND IMPLEMENTATION The source codes are available at https://github.com/Tastanlab/DeepKinZero. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Iman Deznabi
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey
- College of Information and Computer Sciences, University of Massachusetts, Amherst, MA 01003, USA
| | - Busra Arabaci
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey
| | - Mehmet Koyutürk
- Department of Computer and Data Sciences
- Center for Proteomics & Bioinformatics, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Oznur Tastan
- Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul 34956, Turkey
| |
Collapse
|
11
|
Duan Z, Dou S, Liu Z, Li B, Yi B, Shen J, Tu J, Fu T, Dai C, Ma C. Comparative phosphoproteomic analysis of compatible and incompatible pollination in Brassica napus L. Acta Biochim Biophys Sin (Shanghai) 2020; 52:446-456. [PMID: 32268372 DOI: 10.1093/abbs/gmaa011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2019] [Revised: 11/27/2019] [Accepted: 02/14/2020] [Indexed: 12/31/2022] Open
Abstract
Self-incompatibility (SI) promotes outbreeding and prevents self-fertilization to promote genetic diversity in angiosperms. Several studies have been carried to investigate SI signaling in plants; however, protein phosphorylation and dephosphorylation in the fine-tuning of the SI response remain insufficiently understood. Here, we performed a phosphoproteomic analysis to identify the phosphoproteins in the stigma of self-compatible 'Westar' and self-incompatible 'W-3' Brassica napus lines. A total of 4109 phosphopeptides representing 1978 unique protein groups were identified. Moreover, 405 and 248 phosphoproteins were significantly changed in response to SI and self-compatibility, respectively. Casein kinase II (CK II) phosphorylation motifs were enriched in self-incompatible response and identified 127 times in 827 dominant SI phosphorylation residues. Functional annotation of the identified phosphoproteins revealed the major roles of these phosphoproteins in plant-pathogen interactions, cell wall modification, mRNA surveillance, RNA degradation, and plant hormone signal transduction. In particular, levels of homolog proteins ABF3, BKI1, BZR2/BSE1, and EIN2 were significantly increased in pistils pollinated with incompatible pollens. Abscisic acid and ethephon treatment partially inhibited seed set, while brassinolide promoted pollen germination and tube growth in SI response. Collectively, our results provided an overview of protein phosphorylation during compatible/incompatible pollination, which may be a potential component of B. napus SI responses.
Collapse
Affiliation(s)
- Zhiqiang Duan
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Shengwei Dou
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Zhiquan Liu
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Bing Li
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Bin Yi
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Jinxiong Shen
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Jinxing Tu
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Tingdong Fu
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Cheng Dai
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| | - Chaozhi Ma
- National Key Laboratory of Crop Genetic Improvement, National Center of Rapeseed Improvement in Wuhan, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
12
|
Common Functions of Disordered Proteins across Evolutionary Distant Organisms. Int J Mol Sci 2020; 21:ijms21062105. [PMID: 32204351 PMCID: PMC7139818 DOI: 10.3390/ijms21062105] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 03/16/2020] [Accepted: 03/17/2020] [Indexed: 12/14/2022] Open
Abstract
Intrinsically disordered proteins and regions typically lack a well-defined structure and thus fall outside the scope of the classic sequence–structure–function relationship. Hence, classic sequence- or structure-based bioinformatic approaches are often not well suited to identify homology or predict the function of unknown intrinsically disordered proteins. Here, we give selected examples of intrinsic disorder in plant proteins and present how protein function is shared, altered or distinct in evolutionary distant organisms. Furthermore, we explore how examining the specific role of disorder across different phyla can provide a better understanding of the common features that protein disorder contributes to the respective biological mechanism.
Collapse
|
13
|
Wu M, Lu P, Yang Y, Liu L, Wang H, Xu Y, Chu J. LipoSVM: Prediction of Lysine Lipoylation in Proteins based on the Support Vector Machine. Curr Genomics 2019; 20:362-370. [PMID: 32476993 PMCID: PMC7235397 DOI: 10.2174/1389202919666191014092843] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 08/09/2019] [Accepted: 09/05/2019] [Indexed: 12/21/2022] Open
Abstract
Background Lysine lipoylation which is a rare and highly conserved post-translational modification of proteins has been considered as one of the most important processes in the biological field. To obtain a comprehensive understanding of regulatory mechanism of lysine lipoylation, the key is to identify lysine lipoylated sites. The experimental methods are expensive and laborious. Due to the high cost and complexity of experimental methods, it is urgent to develop computational ways to predict lipoylation sites. Methodology In this work, a predictor named LipoSVM is developed to accurately predict lipoylation sites. To overcome the problem of an unbalanced sample, synthetic minority over-sampling technique (SMOTE) is utilized to balance negative and positive samples. Furthermore, different ratios of positive and negative samples are chosen as training sets. Results By comparing five different encoding schemes and five classification algorithms, LipoSVM is constructed finally by using a training set with positive and negative sample ratio of 1:1, combining with position-specific scoring matrix and support vector machine. The best performance achieves an accuracy of 99.98% and AUC 0.9996 in 10-fold cross-validation. The AUC of independent test set reaches 0.9997, which demonstrates the robustness of LipoSVM. The analysis between lysine lipoylation and non-lipoylation fragments shows significant statistical differences. Conclusion A good predictor for lysine lipoylation is built based on position-specific scoring matrix and support vector machine. Meanwhile, an online webserver LipoSVM can be freely downloaded from https://github.com/stars20180811/LipoSVM.
Collapse
Affiliation(s)
- Meiqi Wu
- Department of Applied Mathematics, University of Science and Technology Beijing, Beijing 100083, China
| | - Pengchao Lu
- Equipment Leasing Company of China Petroleum Pipeline Engineering Co., Ltd. 065000 Langfang City, Hebei Province, China
| | - Yingxi Yang
- Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Hong Kong, China
| | - Liwen Liu
- Department of Applied Mathematics, University of Science and Technology Beijing, Beijing 100083, China
| | - Hui Wang
- Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
| | - Yan Xu
- Department of Applied Mathematics, University of Science and Technology Beijing, Beijing 100083, China
| | - Jixun Chu
- Department of Applied Mathematics, University of Science and Technology Beijing, Beijing 100083, China
| |
Collapse
|
14
|
Wang J, Yang B, An Y, Marquez-Lago T, Leier A, Wilksch J, Hong Q, Zhang Y, Hayashida M, Akutsu T, Webb GI, Strugnell RA, Song J, Lithgow T. Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches. Brief Bioinform 2019; 20:931-951. [PMID: 29186295 PMCID: PMC6585386 DOI: 10.1093/bib/bbx164] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2017] [Revised: 11/08/2017] [Indexed: 12/13/2022] Open
Abstract
In the course of infecting their hosts, pathogenic bacteria secrete numerous effectors, namely, bacterial proteins that pervert host cell biology. Many Gram-negative bacteria, including context-dependent human pathogens, use a type IV secretion system (T4SS) to translocate effectors directly into the cytosol of host cells. Various type IV secreted effectors (T4SEs) have been experimentally validated to play crucial roles in virulence by manipulating host cell gene expression and other processes. Consequently, the identification of novel effector proteins is an important step in increasing our understanding of host-pathogen interactions and bacterial pathogenesis. Here, we train and compare six machine learning models, namely, Naïve Bayes (NB), K-nearest neighbor (KNN), logistic regression (LR), random forest (RF), support vector machines (SVMs) and multilayer perceptron (MLP), for the identification of T4SEs using 10 types of selected features and 5-fold cross-validation. Our study shows that: (1) including different but complementary features generally enhance the predictive performance of T4SEs; (2) ensemble models, obtained by integrating individual single-feature models, exhibit a significantly improved predictive performance and (3) the 'majority voting strategy' led to a more stable and accurate classification performance when applied to predicting an ensemble learning model with distinct single features. We further developed a new method to effectively predict T4SEs, Bastion4 (Bacterial secretion effector predictor for T4SS), and we show our ensemble classifier clearly outperforms two recent prediction tools. In summary, we developed a state-of-the-art T4SE predictor by conducting a comprehensive performance evaluation of different machine learning algorithms along with a detailed analysis of single- and multi-feature selections.
Collapse
Affiliation(s)
- Jiawei Wang
- Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia
| | - Bingjiao Yang
- National Engineering Research Center for Equipment and Technology of Cold Strip Rolling, College of Mechanical Engineering from Yanshan University, China
| | - Yi An
- College of Information Engineering, Northwest A&F University, China
| | - Tatiana Marquez-Lago
- Department of Genetics, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | - André Leier
- Department of Genetics and the Informatics Institute, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | - Jonathan Wilksch
- Department of Microbiology and Immunology at the University of Melbourne, Australia
| | | | - Yang Zhang
- Computer Science and Engineering in 2015 fromNorthwestern Polytechnical University, China
| | | | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
| | - Geoffrey I Webb
- Faculty of Information Technology, Monash Centre for Data Science, Monash University
| | - Richard A Strugnell
- Department of Microbiology and Immunology, Faculty of Medicine Dentistry and Health Sciences, University of Melbourne
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | - Trevor Lithgow
- Department of Microbiology at Monash University, Australia
| |
Collapse
|
15
|
Zhang S, Li X, Fan C, Wu Z, Liu Q. Application of Machine Learning Techniques to Predict Protein Phosphorylation Sites. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180907150928] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein phosphorylation is one of the most important post-translational modifications of proteins.
Almost all processes that regulate the life activities of an organism as well as almost all physiological
and pathological processes are involved in protein phosphorylation. In this paper, we summarize
specific implementation and application of the methods used in protein phosphorylation site prediction
such as the support vector machine algorithm, random forest, Jensen-Shannon divergence combined
with quadratic discriminant analysis, Adaboost algorithm, increment of diversity with quadratic
discriminant analysis, modified CKSAAP algorithm, Bayes classifier combined with phosphorylation
sequences enrichment analysis, least absolute shrinkage and selection operator, stochastic search variable
selection, partial least squares and deep learning. On the basis of this prediction, we use k-nearest
neighbor algorithm with BLOSUM80 matrix method to predict phosphorylation sites. Firstly, we construct
dataset and remove the redundant set of positive and negative samples, that is, removal of protein
sequences with similarity of more than 30%. Next, the proposed method is evaluated by sensitivity
(Sn), specificity (Sp), accuracy (ACC) and Mathew’s correlation coefficient (MCC) these four metrics.
Finally, tenfold cross-validation is employed to evaluate this method. The result, which is verified by
tenfold cross-validation, shows that the average values of Sn, Sp, ACC and MCC of three types of amino
acid (serine, threonine, and tyrosine) are 90.44%, 86.95%, 88.74% and 0.7742, respectively. A
comparison with the predictive performance of PhosphoSVM and Musite reveals that the prediction
performance of the proposed method is better, and it has the advantages of simplicity, practicality and
low time complexity in classification.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Xian Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Chengcheng Fan
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Zhehui Wu
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Qian Liu
- Centre for Biostatistics, School of Health Sciences, The University of Manchester, Manchester, M13 9PL, United Kingdom
| |
Collapse
|
16
|
Liu Y, Wang M, Xi J, Luo F, Li A. PTM-ssMP: A Web Server for Predicting Different Types of Post-translational Modification Sites Using Novel Site-specific Modification Profile. Int J Biol Sci 2018; 14:946-956. [PMID: 29989096 PMCID: PMC6036757 DOI: 10.7150/ijbs.24121] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Accepted: 01/24/2018] [Indexed: 12/26/2022] Open
Abstract
Protein post-translational modifications (PTMs) are chemical modifications of a protein after its translation. Owing to its play an important role in deep understanding of various biological processes and the development of effective drugs, PTM site prediction have become a hot topic in bioinformatics. Recently, many online tools are developed to prediction various types of PTM sites, most of which are based on local sequence and some biological information. However, few of existing tools consider the relations between different PTMs for their prediction task. Here, we develop a web server called PTM-ssMP to predict PTM site, which adopts site-specific modification profile (ssMP) to efficiently extract and encode the information of both proximal PTMs and local sequence simultaneously. In PTM-ssMP we provide efficient prediction of multiple types of PTM site including phosphorylation, lysine acetylation, ubiquitination, sumoylation, methylation, O-GalNAc, O-GlcNAc, sulfation and proteolytic cleavage. To assess the performance of PTM-ssMP, a large number of experimentally verified PTM sites are collected from several sources and used to train and test the prediction models. Our results suggest that ssMP consistently contributes to remarkable improvement of prediction performance. In addition, results of independent tests demonstrate that PTM-ssMP compares favorably with other existing tools for different PTM types. PTM-ssMP is implemented as an online web server with user-friendly interface, which is freely available at http://bioinformatics.ustc.edu.cn/PTM-ssMP/index/.
Collapse
Affiliation(s)
- Yu Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Jianing Xi
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Fenglin Luo
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|
17
|
Zhang C, Zhai Z, Tang M, Cheng Z, Li T, Wang H, Zhu WG. Quantitative proteome-based systematic identification of SIRT7 substrates. Proteomics 2017; 17. [PMID: 28556401 DOI: 10.1002/pmic.201600395] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Revised: 05/18/2017] [Accepted: 05/23/2017] [Indexed: 12/22/2022]
Abstract
SIRT7 is a class III histone deacetylase that is involved in numerous cellular processes. Only six substrates of SIRT7 have been reported thus far, so we aimed to systematically identify SIRT7 substrates using stable-isotope labeling with amino acids in cell culture (SILAC) coupled with quantitative mass spectrometry (MS). Using SIRT7+/+ and SIRT7-/- mouse embryonic fibroblasts as our model system, we identified and quantified 1493 acetylation sites in 789 proteins, of which 261 acetylation sites in 176 proteins showed ≥2-fold change in acetylation state between SIRT7-/- and SIRT7+/+ cells. These proteins were considered putative SIRT7 substrates and were carried forward for further analysis. We then validated the predictive efficiency of the SILAC-MS experiment by assessing substrate acetylation status in vitro in six predicted proteins. We also performed a bioinformatic analysis of the MS data, which indicated that many of the putative protein substrates were involved in metabolic processes. Finally, we expanded our list of candidate substrates by performing a bioinformatics-based prediction analysis of putative SIRT7 substrates, using our list of putative substrates as a positive training set, and again validated a subset of the proteins in vitro. In summary, we have generated a comprehensive list of SIRT7 candidate substrates.
Collapse
Affiliation(s)
- Chaohua Zhang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Beijing Key Laboratory of Protein Posttranslational Modifications and Cell Function, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China
| | - Zichao Zhai
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China
| | - Ming Tang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Beijing Key Laboratory of Protein Posttranslational Modifications and Cell Function, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China.,Department of Biochemistry and Molecular Biology, School of Medicine, Shenzhen University, Shenzhen, P. R. China
| | - Zhongyi Cheng
- Jingjie PTM Biolab (Hangzhou) Co. Ltd, Hangzhou, P. R. China
| | - Tingting Li
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China
| | - Haiying Wang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Beijing Key Laboratory of Protein Posttranslational Modifications and Cell Function, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China
| | - Wei-Guo Zhu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Beijing Key Laboratory of Protein Posttranslational Modifications and Cell Function, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing, P. R. China.,Department of Biochemistry and Molecular Biology, School of Medicine, Shenzhen University, Shenzhen, P. R. China.,Peking University-Tsinghua University Center for Life Sciences, Beijing, P. R. China
| |
Collapse
|
18
|
Wang M, Wang T, Li A. ksrMKL: a novel method for identification of kinase-substrate relationships using multiple kernel learning. PeerJ 2017; 5:e4182. [PMID: 29340231 PMCID: PMC5741978 DOI: 10.7717/peerj.4182] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 12/01/2017] [Indexed: 01/24/2023] Open
Abstract
Phosphorylation exerts a crucial role in multiple biological cellular processes which is catalyzed by protein kinases and closely related to many diseases. Identification of kinase-substrate relationships is important for understanding phosphorylation and provides a fundamental basis for further disease-related research and drug design. In this study, we develop a novel computational method to identify kinase-substrate relationships based on multiple kernel learning. The comparative analysis is based on a 10-fold cross-validation process and the dataset collected from the Phospho.ELM database. The results show that ksrMKL is greatly improved in various measures when compared with the single kernel support vector machine. Furthermore, with an independent test dataset extracted from the PhosphoSitePlus database, we compare ksrMKL with two existing kinase-substrate relationship prediction tools, namely iGPS and PKIS. The experimental results show that ksrMKL has better prediction performance than these existing tools.
Collapse
Affiliation(s)
- Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, China
| | - Tao Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, China
| |
Collapse
|
19
|
Du PF, Zhao W, Miao YY, Wei LY, Wang L. UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences. Int J Mol Sci 2017; 18:ijms18112400. [PMID: 29135934 PMCID: PMC5713368 DOI: 10.3390/ijms18112400] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 11/01/2017] [Accepted: 11/03/2017] [Indexed: 01/12/2023] Open
Abstract
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.
Collapse
Affiliation(s)
- Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Wei Zhao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Yang-Yang Miao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
- School of Chemical Engineering, Tianjin University, Tianjin 300350, China.
| | - Le-Yi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Likun Wang
- Institute of Systems Biomedicine, Beijing Key Laboratory of Tumor Systems Biology, Department of Pathology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing 100191, China.
| |
Collapse
|
20
|
Characterizing Gene and Protein Crosstalks in Subjects at Risk of Developing Alzheimer’s Disease: A New Computational Approach. Processes (Basel) 2017. [DOI: 10.3390/pr5030047] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
|
21
|
PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Sci Rep 2017; 7:6862. [PMID: 28761071 PMCID: PMC5537252 DOI: 10.1038/s41598-017-07199-4] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2016] [Accepted: 06/27/2017] [Indexed: 12/31/2022] Open
Abstract
Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.
Collapse
|
22
|
Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. QUANTITATIVE BIOLOGY 2016. [DOI: 10.1007/s40484-016-0081-2] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
23
|
Du Y, Zhai Z, Li Y, Lu M, Cai T, Zhou B, Huang L, Wei T, Li T. Prediction of Protein Lysine Acylation by Integrating Primary Sequence Information with Multiple Functional Features. J Proteome Res 2016; 15:4234-4244. [PMID: 27774790 DOI: 10.1021/acs.jproteome.6b00240] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based proteomic methods have been widely used to identify lysine acylation proteins. However, these experimental approaches often fail to detect proteins that are in low abundance or absent in specific biological samples. To circumvent these problems, we developed a computational method to predict lysine acylation, including acetylation, malonylation, succinylation, and glutarylation. The prediction algorithm integrated flanking primary sequence determinants and evolutionary conservation of acylated lysine as well as multiple protein functional annotation features including gene ontology, conserved domains, and protein-protein interactions. The inclusion of functional annotation features increases predictive power oversimple sequence considerations for four of the acylation species evaluated. For example, the Matthews correlation coefficient (MCC) for the prediction of malonylation increased from 0.26 to 0.73. The performance of prediction was validated against an independent data set for malonylation. Likewise, when tested with independent data sets, the algorithm displayed improved sensitivity and specificity over existing methods. Experimental validation by Western blot experiments and LC-MS/MS detection further attested to the performance of prediction. We then applied our algorithm on to the mouse proteome and reported the global-scale prediction of lysine acetylation, malonylation, succinylation, and glutarylation, which should serve as a valuable resource for future functional studies.
Collapse
Affiliation(s)
| | | | | | | | | | - Bo Zhou
- University of Chinese Academy of Sciences , Beijing 100049, China
| | - Lei Huang
- College of Information Science and Engineering, Ocean University of China , Qingdao, China
| | | | | |
Collapse
|
24
|
Wang M, Jiang Y, Xu X. A novel method for predicting post-translational modifications on serine and threonine sites by using site-modification network profiles. MOLECULAR BIOSYSTEMS 2016; 11:3092-100. [PMID: 26344496 DOI: 10.1039/c5mb00384a] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Post-translational modifications (PTMs) regulate many aspects of biological behaviours including protein-protein interactions and cellular processes. Identification of PTM sites is helpful for understanding the PTM regulatory mechanisms. The PTMs on serine and threonine sites include phosphorylation, O-linked glycosylation and acetylation. Although a lot of computational approaches have been developed for PTM site prediction, currently most of them generate the predictive models by employing only local sequence information and few of them consider the relationship between different PTMs. In this paper, by adopting the site-modification network (SMNet) profiles that efficiently incorporate in situ PTM information, we develop a novel method to predict PTM sites on serine and threonine. PTM data are collected from various PTM databases and the SMNet is built to reflect the relationship between multiple PTMs, from which SMNet profiles are extracted to train predictive models based on SVM. Performance analysis of the SVM models shows that the SMNet profiles play an important role in accurately predicting PTM sites on serine and threonine. Furthermore, the proposed method is compared with existing PTM prediction approaches. The results from 10-fold cross-validation demonstrate that the proposed method with SMNet profiles performs remarkably better than existing methods, suggesting the power of SMNet profiles in identifying PTM sites.
Collapse
Affiliation(s)
- Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, People's Republic of China
| | | | | |
Collapse
|
25
|
Karabulut NP, Frishman D. Sequence- and Structure-Based Analysis of Tissue-Specific Phosphorylation Sites. PLoS One 2016; 11:e0157896. [PMID: 27332813 PMCID: PMC4917084 DOI: 10.1371/journal.pone.0157896] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2016] [Accepted: 06/07/2016] [Indexed: 01/22/2023] Open
Abstract
Phosphorylation is the most widespread and well studied reversible posttranslational modification. Discovering tissue-specific preferences of phosphorylation sites is important as phosphorylation plays a role in regulating almost every cellular activity and disease state. Here we present a comprehensive analysis of global and tissue-specific sequence and structure properties of phosphorylation sites utilizing recent proteomics data. We identified tissue-specific motifs in both sequence and spatial environments of phosphorylation sites. Target site preferences of kinases across tissues indicate that, while many kinases mediate phosphorylation in all tissues, there are also kinases that exhibit more tissue-specific preferences which, notably, are not caused by tissue-specific kinase expression. We also demonstrate that many metabolic pathways are differentially regulated by phosphorylation in different tissues.
Collapse
Affiliation(s)
- Nermin Pinar Karabulut
- Department of Genome Oriented Bioinformatics, Technische Universität München, Freising, Germany
| | - Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technische Universität München, Freising, Germany
- Helmholtz Zentrum Munich; German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany
- St Petersburg State Polytechnical University, St Petersburg, Russia
- * E-mail:
| |
Collapse
|
26
|
Prediction of aptamer-protein interacting pairs using an ensemble classifier in combination with various protein sequence attributes. BMC Bioinformatics 2016; 17:225. [PMID: 27245069 PMCID: PMC4888498 DOI: 10.1186/s12859-016-1087-5] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2016] [Accepted: 05/17/2016] [Indexed: 02/05/2023] Open
Abstract
Background Aptamer-protein interacting pairs play a variety of physiological functions and therapeutic potentials in organisms. Rapidly and effectively predicting aptamer-protein interacting pairs is significant to design aptamers binding to certain interested proteins, which will give insight into understanding mechanisms of aptamer-protein interacting pairs and developing aptamer-based therapies. Results In this study, an ensemble method is presented to predict aptamer-protein interacting pairs with hybrid features. The features for aptamers are extracted from Pseudo K-tuple Nucleotide Composition (PseKNC) while the features for proteins incorporate Discrete Cosine Transformation (DCT), disorder information, and bi-gram Position Specific Scoring Matrix (PSSM). We investigate predictive capabilities of various feature spaces. The proposed ensemble method obtains the best performance with Youden’s Index of 0.380, using the hybrid feature space of PseKNC, DCT, bi-gram PSSM, and disorder information by 10-fold cross validation. The Relief-Incremental Feature Selection (IFS) method is adopted to obtain the optimal feature set. Based on the optimal feature set, the proposed method achieves a balanced performance with a sensitivity of 0.753 and a specificity of 0.725 on the training dataset, which indicates that this method can solve the imbalanced data problem effectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous study with a sensitivity of 0.738 and a Youden’s Index of 0.451. Conclusions These results suggest that the proposed method can be a potential candidate for aptamer-protein interacting pair prediction, which may contribute to finding novel aptamer-protein interacting pairs and understanding the relationship between aptamers and proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1087-5) contains supplementary material, which is available to authorized users.
Collapse
|
27
|
Wang B, Wang M, Jiang Y, Sun D, Xu X. A novel network-based computational method to predict protein phosphorylation on tyrosine sites. J Bioinform Comput Biol 2016; 13:1542005. [DOI: 10.1142/s0219720015420056] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Phosphorylation plays a great role in regulating a variety of cellular processes and the identification of tyrosine phosphorylation sites is fundamental for understanding the post-translational modification (PTM) regulation processes. Although a lot of computational methods have been developed, most of them only concern local sequence information and few studies focus on the tyrosine sites with in situ PTM information, which refers to different types of PTM occurring on the same modification site. In this study, by constructing the site-modification network that efficiently incorporates in situ PTM information, we introduce a novel network-based computational method, site-modification network-based inference (SMNBI) to predict tyrosine phosphorylation. In order to verify the effectiveness of the proposed method, we compare it with other network-based computational methods. The results clearly show the superior performance of SMNBI. Besides, we extensively compare SMNBI with other sequence-based methods including SVM and Bayesian decision theory. The evaluation demonstrates the power of site-modification network in predicting tyrosine phosphorylation. The proposed method is freely available at http://bioinformatics.ustc.edu.cn/smnbi/ .
Collapse
Affiliation(s)
- Binghua Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
- Centers for Biomedical Engineering, University of Science and Technology of China, Hefei 230027, China
| | - Yujie Jiang
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| | - Dongdong Sun
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| | - Xiaoyi Xu
- School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
| |
Collapse
|
28
|
Shi SP, Xu HD, Wen PP, Qiu JD. Progress and challenges in predicting protein methylation sites. MOLECULAR BIOSYSTEMS 2015; 11:2610-2619. [PMID: 26080040 DOI: 10.1039/c5mb00259a] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Abstract
Protein methylation catalyzed by methyltransferases carries many important biological functions. Methylation and their regulatory enzymes are involved in a variety of human disease states, raising the possibility that abnormally methylated proteins can be disease markers and methyltransferases are potential therapeutic targets. Identification of methylation sites is a prerequisite for decoding methylation regulatory networks in living cells and understanding their physiological roles that have been implicated in the pathological processes. Due to various limitations of experimental methods, in silico approaches for identifying novel methylation sites have become increasingly popular. In this review, we summarize the progress in the prediction of protein methylation sites from the dataset, feature representation, prediction algorithm and online resources in the past ten years. We also discuss the challenges that are faced while developing novel predictors in the future. The development and application of methylation site prediction is a promising field of systematic biology, provided that protein methyltransferases, species and functional information will be taken into account.
Collapse
Affiliation(s)
- Shao-Ping Shi
- Department of Chemistry, Nanchang University, Nanchang, 330031, China.
| | | | | | | |
Collapse
|
29
|
Tyrosine Kinase Ligand-Receptor Pair Prediction by Using Support Vector Machine. Adv Bioinformatics 2015; 2015:528097. [PMID: 26347773 PMCID: PMC4548105 DOI: 10.1155/2015/528097] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Revised: 07/14/2015] [Accepted: 07/15/2015] [Indexed: 01/22/2023] Open
Abstract
Receptor tyrosine kinases are essential proteins involved in cellular differentiation and proliferation in vivo and are heavily involved in allergic diseases, diabetes, and onset/proliferation of cancerous cells. Identifying the interacting partner of this protein, a growth factor ligand, will provide a deeper understanding of cellular proliferation/differentiation and other cell processes. In this study, we developed a method for predicting tyrosine kinase ligand-receptor pairs from their amino acid sequences. We collected tyrosine kinase ligand-receptor pairs from the Database of Interacting Proteins (DIP) and UniProtKB, filtered them by removing sequence redundancy, and used them as a dataset for machine learning and assessment of predictive performance. Our prediction method is based on support vector machines (SVMs), and we evaluated several input features suitable for tyrosine kinase for machine learning and compared and analyzed the results. Using sequence pattern information and domain information extracted from sequences as input features, we obtained 0.996 of the area under the receiver operating characteristic curve. This accuracy is higher than that obtained from general protein-protein interaction pair predictions.
Collapse
|
30
|
Xu X, Li A, Wang M. Prediction of human disease-associated phosphorylation sites with combined feature selection approach and support vector machine. IET Syst Biol 2015; 9:155-163. [PMID: 26243832 PMCID: PMC8687269 DOI: 10.1049/iet-syb.2014.0051] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Revised: 01/25/2015] [Accepted: 02/02/2015] [Indexed: 12/01/2024] Open
Abstract
Phosphorylation is a crucial post-translational modification, which regulates almost all cellular processes in life. It has long been recognised that protein phosphorylation has close relationship with diseases, and therefore many researches are undertaken to predict phosphorylation sites for disease treatment and drug design. However, despite the success achieved by these approaches, no method focuses on disease-associated phosphorylation sites prediction. Herein, for the first time the authors propose a novel approach that is specially designed to identify associations between phosphorylation sites and human diseases. To take full advantage of local sequence information, a combined feature selection method-based support vector machine (CFS-SVM) that incorporates minimum-redundancy-maximum-relevance filtering process and forward feature selection process is developed. Performance evaluation shows that CFS-SVM is significantly better than the widely used classifiers including Bayesian decision theory, k nearest neighbour and random forest. With the extremely high specificity of 99%, CFS-SVM can still achieve a high sensitivity. Besides, tests on extra data confirm the effectiveness and general applicability of CFS-SVM approach on a variety of diseases. Finally, the analysis of selected features and corresponding kinases also help the understanding of the potential mechanism of disease-phosphorylation relationships and guide further experimental validations.
Collapse
Affiliation(s)
- Xiaoyi Xu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, People's Republic of China
| | - Ao Li
- Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, People's Republic of China
| | - Minghui Wang
- Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, People's Republic of China.
| |
Collapse
|
31
|
Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features. Sci Rep 2014; 4:5765. [PMID: 25042424 PMCID: PMC4104576 DOI: 10.1038/srep05765] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Accepted: 07/03/2014] [Indexed: 11/08/2022] Open
Abstract
Lysine acetylation is a reversible post-translational modification, playing an important role in cytokine signaling, transcriptional regulation, and apoptosis. To fully understand acetylation mechanisms, identification of substrates and specific acetylation sites is crucial. Experimental identification is often time-consuming and expensive. Alternative bioinformatics methods are cost-effective and can be used in a high-throughput manner to generate relatively precise predictions. Here we develop a method termed as SSPKA for species-specific lysine acetylation prediction, using random forest classifiers that combine sequence-derived and functional features with two-step feature selection. Feature importance analysis indicates functional features, applied for lysine acetylation site prediction for the first time, significantly improve the predictive performance. We apply the SSPKA model to screen the entire human proteome and identify many high-confidence putative substrates that are not previously identified. The results along with the implemented Java tool, serve as useful resources to elucidate the mechanism of lysine acetylation and facilitate hypothesis-driven experimental design and validation.
Collapse
|
32
|
Prediction of substrate sites for protein phosphatases 1B, SHP-1, and SHP-2 based on sequence features. Amino Acids 2014; 46:1919-28. [PMID: 24760585 DOI: 10.1007/s00726-014-1739-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 03/31/2014] [Indexed: 10/25/2022]
Abstract
Tyrosine phosphorylation plays crucial roles in numerous physiological processes. The level of phosphorylation state depends on the combined action of protein tyrosine kinases and protein tyrosine phosphatases. Detection of possible phosphorylation and dephosphorylation sites can provide useful information to the functional studies of relevant proteins. Several studies have focused on the identification of protein tyrosine kinase substrates. However, compared with protein tyrosine kinases, the prediction of protein tyrosine phosphatase substrates involved in the balance of protein phosphorylation level falls behind. This paper described a method that utilized the k-nearest neighbor algorithm to identity the substrate sites of three protein tyrosine phosphatases based on the sequence features of manually collected dephosphorylation sites. In the performance evaluation, both sensitivities and specificities could reach above 75% for all three protein tyrosine phosphatases. Finally, the method was applied on a set of known tyrosine phosphorylation sites to search for candidate substrates.
Collapse
|
33
|
Suo SB, Qiu JD, Shi SP, Chen X, Liang RP. PSEA: Kinase-specific prediction and analysis of human phosphorylation substrates. Sci Rep 2014; 4:4524. [PMID: 24681538 PMCID: PMC3970127 DOI: 10.1038/srep04524] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 03/11/2014] [Indexed: 11/09/2022] Open
Abstract
Protein phosphorylation catalysed by kinases plays crucial regulatory roles in intracellular signal transduction. With the increasing number of kinase-specific phosphorylation sites and disease-related phosphorylation substrates that have been identified, the desire to explore the regulatory relationship between protein kinases and disease-related phosphorylation substrates is motivated. In this work, we analysed the kinases' characteristic of all disease-related phosphorylation substrates by using our developed Phosphorylation Set Enrichment Analysis (PSEA) method. We evaluated the efficiency of our method with independent test and concluded that our approach is reliable for identifying kinases responsible for phosphorylated substrates. In addition, we found that Mitogen-activated protein kinase (MAPK) and Glycogen synthase kinase (GSK) families are more associated with abnormal phosphorylation. It can be anticipated that our method might be helpful to identify the mechanism of phosphorylation and the relationship between kinase and phosphorylation related diseases. A user-friendly web interface is now freely available at http://bioinfo.ncu.edu.cn/PKPred_Home.aspx.
Collapse
Affiliation(s)
- Sheng-Bao Suo
- Department of Chemistry, Nanchang University, Nanchang, 330031, China
| | - Jian-Ding Qiu
- 1] Department of Chemistry, Nanchang University, Nanchang, 330031, China [2] Department of Chemical Engineering, Pingxiang College, Pingxiang, 337055, China
| | - Shao-Ping Shi
- 1] Department of Chemistry, Nanchang University, Nanchang, 330031, China [2] Department of Mathematics, Nanchang University, Nanchang, 330031, China
| | - Xiang Chen
- Department of Chemistry, Nanchang University, Nanchang, 330031, China
| | - Ru-Ping Liang
- Department of Chemistry, Nanchang University, Nanchang, 330031, China
| |
Collapse
|
34
|
Sobolev BN, Veselovsky AV, Poroikov VV. Prediction of protein post-translational modifications: main trends and methods. RUSSIAN CHEMICAL REVIEWS 2014. [DOI: 10.1070/rc2014v083n02abeh004377] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
35
|
Fan W, Xu X, Shen Y, Feng H, Li A, Wang M. Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest. Amino Acids 2014; 46:1069-78. [PMID: 24452754 DOI: 10.1007/s00726-014-1669-3] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2013] [Accepted: 01/08/2014] [Indexed: 10/25/2022]
Abstract
Reversible protein phosphorylation is one of the most important post-translational modifications, which regulates various biological cellular processes. Identification of the kinase-specific phosphorylation sites is helpful for understanding the phosphorylation mechanism and regulation processes. Although a number of computational approaches have been developed, currently few studies are concerned about hierarchical structures of kinases, and most of the existing tools use only local sequence information to construct predictive models. In this work, we conduct a systematic and hierarchy-specific investigation of protein phosphorylation site prediction in which protein kinases are clustered into hierarchical structures with four levels including kinase, subfamily, family and group. To enhance phosphorylation site prediction at all hierarchical levels, functional information of proteins, including gene ontology (GO) and protein-protein interaction (PPI), is adopted in addition to primary sequence to construct prediction models based on random forest. Analysis of selected GO and PPI features shows that functional information is critical in determining protein phosphorylation sites for every hierarchical level. Furthermore, the prediction results of Phospho.ELM and additional testing dataset demonstrate that the proposed method remarkably outperforms existing phosphorylation prediction methods at all hierarchical levels. The proposed method is freely available at http://bioinformatics.ustc.edu.cn/phos_pred/.
Collapse
Affiliation(s)
- Wenwen Fan
- School of Information Science and Technology, University of Science and Technology of China, 443 Huangshan Road, Hefei, 230027, China,
| | | | | | | | | | | |
Collapse
|
36
|
Wang M, Zhao XM, Tan H, Akutsu T, Whisstock JC, Song J. Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets. ACTA ACUST UNITED AC 2013; 30:71-80. [PMID: 24149049 DOI: 10.1093/bioinformatics/btt603] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Caspases and granzyme B (GrB) are important proteases involved in fundamental cellular processes and play essential roles in programmed cell death, necrosis and inflammation. Although a number of substrates for both types have been experimentally identified, the complete repertoire of caspases and granzyme B substrates remained to be fully characterized. Accordingly, systematic bioinformatics studies of known cleavage sites may provide important insights into their substrate specificity and facilitate the discovery of novel substrates. RESULTS We develop a new bioinformatics tool, termed Cascleave 2.0, which builds on previous success of the Cascleave tool for predicting generic caspase cleavage sites. It can be efficiently used to predict potential caspase-specific cleavage sites for the human caspase-1, 3, 6, 7, 8 and GrB. In particular, we integrate heterogeneous sequence and protein functional information from various sources to improve the prediction accuracy of Cascleave 2.0. During classification, we use both maximum relevance minimum redundancy and forward feature selection techniques to quantify the relative contribution of each feature to prediction and thus remove redundant as well as irrelevant features. A systematic evaluation of Cascleave 2.0 using the benchmark data and comparison with other state-of-the-art tools using independent test data indicate that Cascleave 2.0 outperforms other tools on protease-specific cleavage site prediction of caspase-1, 3, 6, 7 and GrB. Cascleave 2.0 is anticipated to be used as a powerful tool for identifying novel substrates and cleavage sites of caspases and GrB and help understand the functional roles of these important proteases in human proteolytic cascades. AVAILABILITY AND IMPLEMENTATION http://www.structbioinfor.org/cascleave2/.
Collapse
Affiliation(s)
- Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, Department of Computer Science, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan and ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Melbourne, Victoria 3800, Australia
| | | | | | | | | | | |
Collapse
|
37
|
Abstract
Protein phosphorylation is one of the most pervasive post-translational modifications, regulating diverse cellular processes in various organisms. As mass spectrometry-based experimental approaches for identifying phosphorylation events are resource-intensive, many computational methods have been proposed, in which phosphorylation site prediction is formulated as a classification problem. They differ in several ways, and one crucial issue is the construction of training data and test data for unbiased performance evaluation. In this article, we categorize the existing data construction methods and try to answer three questions: (i) Is it equivalent to use different data construction methods in the assessment of phosphorylation site prediction algorithms? (ii) What kind of test data set is unbiased for assessing the prediction performance of a trained algorithm in different real world scenarios? (iii) Among the summarized training data construction methods, which one(s) has better generalization performance for most scenarios? To answer these questions, we conduct comprehensive experimental studies for both non-kinase-specific and kinase-specific prediction tasks. The experimental results show that: (i) different data construction methods can lead to significantly different prediction performance; (ii) there can be different test data construction methods that are unbiased with respect to different real world scenarios; and (iii) different data construction methods have different generalization performance in different real world scenarios. Therefore, when developing new algorithms in future research, people should concentrate on what kind of scenario their algorithm will work for, what the corresponding unbiased test data are and which training data construction method can generate best generalization performance.
Collapse
|
38
|
Lai ACW, Nguyen Ba AN, Moses AM. Predicting kinase substrates using conservation of local motif density. ACTA ACUST UNITED AC 2012; 28:962-9. [PMID: 22302575 DOI: 10.1093/bioinformatics/bts060] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
MOTIVATION Protein kinases represent critical links in cell signaling. A central problem in computational biology is to systematically identify their substrates. RESULTS This study introduces a new method to predict kinase substrates by extracting evolutionary information from multiple sequence alignments in a manner that is tolerant to degenerate motif positioning. Given a known consensus, the new method (ConDens) compares the observed density of matches to a null model of evolution and does not require labeled training data. We confirmed that ConDens has improved performance compared with several existing methods in the field. Further, we show that it is generalizable and can predict interesting substrates for several important eukaryotic kinases where training data is not available. AVAILABILITY AND IMPLEMENTATION ConDens can be found at http://www.moseslab.csb.utoronto.ca/andyl/. CONTACT alan.moses@utoronto.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andy C W Lai
- Department of Cell and Systems Biology, University of Toronto, Toronto, Canada M5S 3G5
| | | | | |
Collapse
|
39
|
Li T, Du Y, Wang L, Huang L, Li W, Lu M, Zhang X, Zhu WG. Characterization and prediction of lysine (K)-acetyl-transferase specific acetylation sites. Mol Cell Proteomics 2011; 11:M111.011080. [PMID: 21964354 DOI: 10.1074/mcp.m111.011080] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Lysine acetylation is a well-studied post-translational modification on both histone and nonhistone proteins. More than 2000 acetylated proteins and 4000 lysine acetylation sites have been identified by large scale mass spectrometry or traditional experimental methods. Although over 20 lysine (K)-acetyl-transferases (KATs) have been characterized, which KAT is responsible for a given protein or lysine site acetylation is mostly unknown. In this work, we collected KAT-specific acetylation sites manually and analyzed sequence features surrounding the acetylated lysine of substrates from three main KAT families (CBP/p300, GCN5/PCAF, and the MYST family). We found that each of the three KAT families acetylates lysines with different sequence features. Based on these differences, we developed a computer program, Acetylation Set Enrichment Based method to predict which KAT-families are responsible for acetylation of a given protein or lysine site. Finally, we evaluated the efficiency of our method, and experimentally detected four proteins that were predicted to be acetylated by two KAT families when one representative member of the KAT family is over expressed. We conclude that our approach, combined with more traditional experimental methods, may be useful for identifying KAT families responsible for acetylated substrates proteome-wide.
Collapse
Affiliation(s)
- Tingting Li
- Department of Biomedical Informatics, Peking University Health Science Center, Beijing 100191, China; Institute of Systems Biomedicine, Peking University Health Science Center, Beijing 100191, China.
| | - Yipeng Du
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Biochemistry and Molecular Biology, Peking University Health Science Center, Beijing 100191, China
| | - Likun Wang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Lei Huang
- Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; Graduate University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wenlin Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China
| | - Ming Lu
- Department of Biomedical Informatics, Peking University Health Science Center, Beijing 100191, China
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wei-Guo Zhu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Biochemistry and Molecular Biology, Peking University Health Science Center, Beijing 100191, China; The Center for Life Science, Peking University, Beijing 100871, China.
| |
Collapse
|
40
|
Trost B, Kusalik A. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 2011; 27:2927-35. [DOI: 10.1093/bioinformatics/btr525] [Citation(s) in RCA: 121] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|