1
|
Jiang L, Wang D, Xu D. A Pretrained ELECTRA Model for Kinase-Specific Phosphorylation Site Prediction. Methods Mol Biol 2022; 2499:105-124. [PMID: 35696076 DOI: 10.1007/978-1-0716-2317-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Phosphorylation plays a vital role in signal transduction and cell cycle. Identifying and understanding phosphorylation through machine-learning methods has a long history. However, existing methods only learn representations of a protein sequence segment from a labeled dataset itself, which could result in biased or incomplete features, especially for kinase-specific phosphorylation site prediction in which training data are typically sparse. To learn a comprehensive contextual representation of a protein sequence segment for kinase-specific phosphorylation site prediction, we pretrained our model from over 24 million unlabeled sequence fragments using ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). The pretrained model was applied to kinase-specific site prediction of kinases CDK, PKA, CK2, MAPK, and PKC. The pretrained ELECTRA model achieves 9.02% improvement over BERT and 11.10% improvement over MusiteDeep in the area under the precision-recall curve on the benchmark data.
Collapse
Affiliation(s)
- Lei Jiang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA.
| |
Collapse
|
2
|
Arico DS, Beati P, Wengier DL, Mazzella MA. A novel strategy to uncover specific GO terms/phosphorylation pathways in phosphoproteomic data in Arabidopsis thaliana. BMC PLANT BIOLOGY 2021; 21:592. [PMID: 34906086 PMCID: PMC8670200 DOI: 10.1186/s12870-021-03377-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 11/29/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND Proteins are the workforce of the cell and their phosphorylation status tailors specific responses efficiently. One of the main challenges of phosphoproteomic approaches is to deconvolute biological processes that specifically respond to an experimental query from a list of phosphoproteins. Comparison of the frequency distribution of GO (Gene Ontology) terms in a given phosphoproteome set with that observed in the genome reference set (GenRS) is the most widely used tool to infer biological significance. Yet, this comparison assumes that GO term distribution between the phosphoproteome and the genome are identical. However, this hypothesis has not been tested due to the lack of a comprehensive phosphoproteome database. RESULTS In this study, we test this hypothesis by constructing three phosphoproteome databases in Arabidopsis thaliana: one based in experimental data (ExpRS), another based in in silico phosphorylation protein prediction (PredRS) and a third that is the union of both (UnRS). Our results show that the three phosphoproteome reference sets show default enrichment of several GO terms compared to GenRS, indicating that GO term distribution in the phosphoproteomes does not match that of the genome. Moreover, these differences overshadow the identification of GO terms that are specifically enriched in a particular condition. To overcome this limitation, we present an additional comparison of the sample of interest with UnRS to uncover GO terms specifically enriched in a particular phosphoproteome experiment. Using this strategy, we found that mRNA splicing and cytoplasmic microtubule compounds are important processes specifically enriched in the phosphoproteome of dark-grown Arabidopsis seedlings. CONCLUSIONS This study provides a novel strategy to uncover GO specific terms in phosphoproteome data of Arabidopsis that could be applied to any other organism. We also highlight the importance of specific phosphorylation pathways that take place during dark-grown Arabidopsis development.
Collapse
Affiliation(s)
- Denise S Arico
- INGEBI-CONICET Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor Torres", Vuelta de Obligado 2490, 1428, CABA, Argentina
| | - Paula Beati
- INGEBI-CONICET Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor Torres", Vuelta de Obligado 2490, 1428, CABA, Argentina
| | - Diego L Wengier
- INGEBI-CONICET Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor Torres", Vuelta de Obligado 2490, 1428, CABA, Argentina
- Department of Chemical Engineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA
| | - Maria Agustina Mazzella
- INGEBI-CONICET Instituto de Investigaciones en Ingeniería Genética y Biología Molecular "Dr. Héctor Torres", Vuelta de Obligado 2490, 1428, CABA, Argentina.
| |
Collapse
|
3
|
Qiu WR, Xu A, Xu ZC, Zhang CH, Xiao X. Identifying Acetylation Protein by Fusing Its PseAAC and Functional Domain Annotation. Front Bioeng Biotechnol 2019; 7:311. [PMID: 31867311 PMCID: PMC6908504 DOI: 10.3389/fbioe.2019.00311] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 10/22/2019] [Indexed: 11/13/2022] Open
Abstract
Acetylation is one of post-translational modification (PTM), which often reacts with acetic acid and brings an acetyl radical to an organic compound. It is helpful to identify acetylation protein correctly for understanding the mechanism of acetylation in biological systems. Although many acetylation sites have been identified by high throughput experimental studies via mass spectrometry, there still are lots of acetylation sites need to be discovered. Computational methods have showed their power for identifying acetylation sites with informatics techniques which usually reduce experiment cost and improve the effectiveness and efficiency. In fact, if there is an approach can distinguish the acetylated proteins from the non-acetylated ones, it is no doubt a very meaningful and effective method for this issue. Here, we proposed a novel computational method for identifying acetylation proteins by extracting features from the conservation information of sequence via gray system model and KNN scores based on the information of functional domain annotation and subcellular localization. The authors have performed the 5-fold cross-validation on three datasets along with much analysis of features and the Relief feature selection algorithm. The obtained accuracies are all satisfactory, as the mean performance, the accuracy is 77.10%, the Matthew's correlation coefficient is 0.5457, and the AUC value is 0.8389. These works might provide useful insights for the related experimental validation, and further studies of other PTM process. For the convenience of related researchers, the web-server named “iACetyP” was established and is accessible at http://www.jci-bioinfo.cn/iAcetyP.
Collapse
Affiliation(s)
- Wang-Ren Qiu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ao Xu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Zhao-Chun Xu
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Chun-Hua Zhang
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- School of Information and Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| |
Collapse
|
4
|
Maiti S, Hassan A, Mitra P. Boosting phosphorylation site prediction with sequence feature-based machine learning. Proteins 2019; 88:284-291. [PMID: 31412138 DOI: 10.1002/prot.25801] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 07/13/2019] [Accepted: 08/08/2019] [Indexed: 12/13/2022]
Abstract
Protein phosphorylation is one of the essential posttranslation modifications playing a vital role in the regulation of many fundamental cellular processes. We propose a LightGBM-based computational approach that uses evolutionary, geometric, sequence environment, and amino acid-specific features to decipher phosphate binding sites from a protein sequence. Our method, while compared with other existing methods on 2429 protein sequences taken from standard Phospho.ELM (P.ELM) benchmark data set featuring 11 organisms reports a higher F1 score = 0.504 (harmonic mean of the precision and recall) and ROC AUC = 0.836 (area under the curve of the receiver operating characteristics). The computation time of our proposed approach is much less than that of the recently developed deep learning-based framework. Structural analysis on selected protein sequences informs that our prediction is the superset of the phosphorylation sites, as mentioned in P.ELM data set. The foundation of our scheme is manual feature engineering and a decision tree-based classification. Hence, it is intuitive, and one can interpret the final tree as a set of rules resulting in a deeper understanding of the relationships between biophysical features and phosphorylation sites. Our innovative problem transformation method permits more control over precision and recall as is demonstrated by the fact that if we incorporate output probability of the existing deep learning framework as an additional feature, then our prediction improves (F1 score = 0.546; ROC AUC = 0.849). The implementation of our method can be accessed at http://cse.iitkgp.ac.in/~pralay/resources/PPSBoost/ and is mirrored at https://cosmos.iitkgp.ac.in/PPSBoost.
Collapse
Affiliation(s)
- Shyantani Maiti
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| | - Atif Hassan
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| | - Pralay Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India
| |
Collapse
|
5
|
Wang D, Zeng S, Xu C, Qiu W, Liang Y, Joshi T, Xu D. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics 2018; 33:3909-3916. [PMID: 29036382 DOI: 10.1093/bioinformatics/btx496] [Citation(s) in RCA: 165] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Accepted: 08/01/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Computational methods for phosphorylation site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of phosphorylation patterns from the raw sequences, and hence it provides a powerful tool for improvement of phosphorylation site prediction. Results We present MusiteDeep, the first deep-learning framework for predicting general and kinase-specific phosphorylation sites. MusiteDeep takes raw sequence data as input and uses convolutional neural networks with a novel two-dimensional attention mechanism. It achieves over a 50% relative improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. Availability and implementation MusiteDeep is provided as an open-source tool available at https://github.com/duolinwang/MusiteDeep. Contact xudong@missouri.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Duolin Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.,Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Shuai Zeng
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Chunhui Xu
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Wangren Qiu
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA.,Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Yanchun Liang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.,Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai 519041, China
| | - Trupti Joshi
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA.,Department of Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO 65211, USA
| | - Dong Xu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China.,Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
6
|
Bolger ME, Arsova B, Usadel B. Plant genome and transcriptome annotations: from misconceptions to simple solutions. Brief Bioinform 2018; 19:437-449. [PMID: 28062412 PMCID: PMC5952960 DOI: 10.1093/bib/bbw135] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Revised: 11/29/2016] [Indexed: 12/14/2022] Open
Abstract
Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plant sciences. Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficient, often the functional annotation process is lagging behind. This might be hampered by the lack of a comprehensive enumeration of simple-to-use tools available to the plant researcher. In this comprehensive review, we present (i) typical ontologies to be used in the plant sciences, (ii) useful databases and resources used for functional annotation, (iii) what to expect from an annotated plant genome, (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources.
Collapse
Affiliation(s)
- Marie E Bolger
- Forschungszentrum Jülich, Wilhelm Johnen Str, Jülich, Germany
| | - Borjana Arsova
- Forschungszentrum Jülich, Wilhelm Johnen Str, Jülich, Germany
- FRS-FNRS Chargé de Recherches, Functional Genomics and Plant Molecular Imaging Center for Protein Engineering (CIP), Dpt of Life Sciences, University of Liège, Quartier de la Vallée, 1, Chemin de la Vallée, 4 - Bât B22, 4000 LIEGE, Belgium
| | - Björn Usadel
- Forschungszentrum Jülich, Wilhelm Johnen Str, Jülich, Germany
- RWTH Aachen University, Institute for Biology I Botany, BioSC, Worringer Weg 3, Aachen, Germany
| |
Collapse
|
7
|
Yao Q, Xu D. Bioinformatics Analysis of Protein Phosphorylation in Plant Systems Biology Using P3DB. Methods Mol Biol 2017; 1558:127-138. [PMID: 28150236 DOI: 10.1007/978-1-4939-6783-4_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Protein phosphorylation is one of the most pervasive protein post-translational modification events in plant cells. It is involved in many plant biological processes, such as plant growth, organ development, and plant immunology, by regulating or switching signaling and metabolic pathways. High-throughput experimental methods like mass spectrometry can easily characterize hundreds to thousands of phosphorylation events in a single experiment. With the increasing volume of the data sets, Plant Protein Phosphorylation DataBase (P3DB, http://p3db.org ) provides a comprehensive, systematic, and interactive online platform to deposit, query, analyze, and visualize these phosphorylation events in many plant species. It stores the protein phosphorylation sites in the context of identified mass spectra, phosphopeptides, and phosphoproteins contributed from various plant proteome studies. In addition, P3DB associates these plant phosphorylation sites to protein physicochemical information in the protein charts and tertiary structures, while various protein annotations from hierarchical kinase phosphatase families, protein domains, and gene ontology are also added into the database. P3DB not only provides rich information, but also interconnects and provides visualization of the data in networks, in systems biology context. Currently, P3DB includes the KiC (Kinase Client) assay network, the protein-protein interaction network, the kinase-substrate network, the phosphatase-substrate network, and the protein domain co-occurrence network. All of these are available to query for and visualize existing phosphorylation events. Although P3DB only hosts experimentally identified phosphorylation data, it provides a plant phosphorylation prediction model for any unknown queries on the fly. P3DB is an entry point to the plant phosphorylation community to deposit and visualize any customized data sets within this systems biology framework. Nowadays, P3DB has become one of the major bioinformatics platforms of protein phosphorylation in plant biology.
Collapse
Affiliation(s)
- Qiuming Yao
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, 1201 Rollins St., Columbia, MO, 65211, USA.
| | - Dong Xu
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, 1201 Rollins St., Columbia, MO, 65211, USA
| |
Collapse
|