1
|
Pan T, Bi Y, Wang X, Zhang Y, Webb GI, Gasser RB, Kurgan L, Song J. SCREEN: A Graph-based Contrastive Learning Tool to Infer Catalytic Residues and Assess Enzyme Mutations. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae094. [PMID: 39724324 PMCID: PMC11961199 DOI: 10.1093/gpbjnl/qzae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 12/05/2024] [Accepted: 12/06/2024] [Indexed: 12/28/2024]
Abstract
The accurate identification of catalytic residues contributes to our understanding of enzyme functions in biological processes and pathways. The increasing number of protein sequences necessitates computational tools for the automated prediction of catalytic residues in enzymes. Here, we introduce SCREEN, a graph neural network for the high-throughput prediction of catalytic residues via the integration of enzyme functional and structural information. SCREEN constructs residue representations based on spatial arrangements and incorporates enzyme function priors into such representations through contrastive learning. We demonstrate that SCREEN (1) consistently outperforms currently-available predictors; (2) provides accurate results when applied to inferred enzyme structures; and (3) generalizes well to enzymes dissimilar from those in the training set. We also show that the putative catalytic residues predicted by SCREEN mimic key structural and biophysical characteristics of native catalytic residues. Moreover, using experimental datasets, we show that SCREEN's predictions can be used to distinguish residues with a high mutation tolerance from those likely to cause functional loss when mutated, indicating that this tool might be used to infer disease-associated mutations. SCREEN is publicly available at https://github.com/BioColLab/SCREEN and https://ngdc.cncb.ac.cn/biocode/tool/7580.
Collapse
Affiliation(s)
- Tong Pan
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Yue Bi
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
| | - Ying Zhang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Clayton, VIC 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Biomedicine Discovery Institute-Wenzhou Medical University Alliance in Clinical and Experimental Biomedicine, Monash University, Clayton, VIC 3800, Australia
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, Department of Clinical Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325015, China
| |
Collapse
|
2
|
Nana Teukam YG, Kwate Dassi L, Manica M, Probst D, Schwaller P, Laino T. Language models can identify enzymatic binding sites in protein sequences. Comput Struct Biotechnol J 2024; 23:1929-1937. [PMID: 38736695 PMCID: PMC11087710 DOI: 10.1016/j.csbj.2024.04.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/05/2024] [Accepted: 04/05/2024] [Indexed: 05/14/2024] Open
Abstract
Recent advances in language modeling have had a tremendous impact on how we handle sequential data in science. Language architectures have emerged as a hotbed of innovation and creativity in natural language processing over the last decade, and have since gained prominence in modeling proteins and chemical processes, elucidating structural relationships from textual/sequential data. Surprisingly, some of these relationships refer to three-dimensional structural features, raising important questions on the dimensionality of the information encoded within sequential data. Here, we demonstrate that the unsupervised use of a language model architecture to a language representation of bio-catalyzed chemical reactions can capture the signal at the base of the substrate-binding site atomic interactions. This allows us to identify the three-dimensional binding site position in unknown protein sequences. The language representation comprises a reaction-simplified molecular-input line-entry system (SMILES) for substrate and products, and amino acid sequence information for the enzyme. This approach can recover, with no supervision, 52.13% of the binding site when considering co-crystallized substrate-enzyme structures as ground truth, vastly outperforming other attention-based models.
Collapse
Affiliation(s)
| | - Loïc Kwate Dassi
- IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland
| | - Matteo Manica
- IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland
| | - Daniel Probst
- IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Switzerland
| | - Philippe Schwaller
- IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Switzerland
| | - Teodoro Laino
- IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), Switzerland
| |
Collapse
|
3
|
Song Y, Yuan Q, Chen S, Zeng Y, Zhao H, Yang Y. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat Commun 2024; 15:8180. [PMID: 39294165 PMCID: PMC11411130 DOI: 10.1038/s41467-024-52533-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 09/11/2024] [Indexed: 09/20/2024] Open
Abstract
Enzymes are crucial in numerous biological processes, with the Enzyme Commission (EC) number being a commonly used method for defining enzyme function. However, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. Here, we propose GraphEC, a geometric graph learning-based EC number predictor using the ESMFold-predicted structures and a pre-trained protein language model. Specifically, we first construct a model to predict the enzyme active sites, which is utilized to predict the EC number. The prediction is further improved through a label diffusion algorithm by incorporating homology information. In parallel, the optimum pH of enzymes is predicted to reflect the enzyme-catalyzed reactions. Experiments demonstrate the superior performance of our model in predicting active sites, EC numbers, and optimum pH compared to other state-of-the-art methods. Additional analysis reveals that GraphEC is capable of extracting functional information from protein structures, emphasizing the effectiveness of geometric graph learning. This technology can be used to identify unannotated enzyme functions, as well as to predict their active sites and optimum pH, with the potential to advance research in synthetic biology, genomics, and other fields.
Collapse
Affiliation(s)
- Yidong Song
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
- High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, China
| | - Sheng Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuansong Zeng
- School of Big Data & Software Engineering, Chongqing University, Chongqing, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou, China.
| |
Collapse
|
4
|
Wang X, Yin X, Jiang D, Zhao H, Wu Z, Zhang O, Wang J, Li Y, Deng Y, Liu H, Luo P, Han Y, Hou T, Yao X, Hsieh CY. Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites. Nat Commun 2024; 15:7348. [PMID: 39187482 PMCID: PMC11347633 DOI: 10.1038/s41467-024-51511-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Accepted: 08/09/2024] [Indexed: 08/28/2024] Open
Abstract
Annotating active sites in enzymes is crucial for advancing multiple fields including drug discovery, disease research, enzyme engineering, and synthetic biology. Despite the development of numerous automated annotation algorithms, a significant trade-off between speed and accuracy limits their large-scale practical applications. We introduce EasIFA, an enzyme active site annotation algorithm that fuses latent enzyme representations from the Protein Language Model and 3D structural encoder, and then aligns protein-level information with the knowledge of enzymatic reactions using a multi-modal cross-attention framework. EasIFA outperforms BLASTp with a 10-fold speed increase and improved recall, precision, f1 score, and MCC by 7.57%, 13.08%, 9.68%, and 0.1012, respectively. It also surpasses empirical-rule-based algorithm and other state-of-the-art deep learning annotation method based on PSSM features, achieving a speed increase ranging from 650 to 1400 times while enhancing annotation quality. This makes EasIFA a suitable replacement for conventional tools in both industrial and academic settings. EasIFA can also effectively transfer knowledge gained from coarsely annotated enzyme databases to smaller, high-precision datasets, highlighting its ability to model sparse and high-quality databases. Additionally, EasIFA shows potential as a catalytic site monitoring tool for designing enzymes with desired functions beyond their natural distribution.
Collapse
Affiliation(s)
- Xiaorui Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
- Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
| | - Xiaodan Yin
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
- Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Huifeng Zhao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Odin Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Yuquan Li
- College of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, 730000, Gansu, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, China
| | - Huanxiang Liu
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China
| | - Pei Luo
- Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
| | - Yuqiang Han
- Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China.
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
5
|
Wang X, Quinn D, Moody TS, Huang M. ALDELE: All-Purpose Deep Learning Toolkits for Predicting the Biocatalytic Activities of Enzymes. J Chem Inf Model 2024; 64:3123-3139. [PMID: 38573056 PMCID: PMC11040732 DOI: 10.1021/acs.jcim.4c00058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 02/15/2024] [Accepted: 03/11/2024] [Indexed: 04/05/2024]
Abstract
Rapidly predicting enzyme properties for catalyzing specific substrates is essential for identifying potential enzymes for industrial transformations. The demand for sustainable production of valuable industry chemicals utilizing biological resources raised a pressing need to speed up biocatalyst screening using machine learning techniques. In this research, we developed an all-purpose deep-learning-based multiple-toolkit (ALDELE) workflow for screening enzyme catalysts. ALDELE incorporates both structural and sequence representations of proteins, alongside representations of ligands by subgraphs and overall physicochemical properties. Comprehensive evaluation demonstrated that ALDELE can predict the catalytic activities of enzymes, and particularly, it identifies residue-based hotspots to guide enzyme engineering and generates substrate heat maps to explore the substrate scope for a given biocatalyst. Moreover, our models notably match empirical data, reinforcing the practicality and reliability of our approach through the alignment with confirmed mutation sites. ALDELE offers a facile and comprehensive solution by integrating different toolkits tailored for different purposes at affordable computational cost and therefore would be valuable to speed up the discovery of new functional enzymes for their exploitation by the industry.
Collapse
Affiliation(s)
- Xiangwen Wang
- School
of Chemistry and Chemical Engineering, Queen’s
University Belfast, Belfast BT9 5AG, Northern Ireland, U.K.
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, Craigavon BT63 5QD, Northern Ireland, U.K.
| | - Derek Quinn
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, Craigavon BT63 5QD, Northern Ireland, U.K.
| | - Thomas S. Moody
- Department
of Biocatalysis and Isotope Chemistry, Almac
Sciences, Craigavon BT63 5QD, Northern Ireland, U.K.
- Arran
Chemical Company Limited, Unit 1 Monksland Industrial Estate, Athlone,
Co., Roscommon N37 DN24, Ireland
| | - Meilan Huang
- School
of Chemistry and Chemical Engineering, Queen’s
University Belfast, Belfast BT9 5AG, Northern Ireland, U.K.
| |
Collapse
|
6
|
Esmaili F, Pourmirzaei M, Ramazi S, Shojaeilangari S, Yavari E. A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1266-1285. [PMID: 37863385 PMCID: PMC11082408 DOI: 10.1016/j.gpb.2023.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 01/16/2023] [Accepted: 03/23/2023] [Indexed: 10/22/2023]
Abstract
Post-translational modifications (PTMs) have key roles in extending the functional diversity of proteins and, as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays a significant role in many biological processes. Disorders in the phosphorylation process lead to multiple diseases, including neurological disorders and cancers. The purpose of this review is to organize this body of knowledge associated with phosphorylation site (p-site) prediction to facilitate future research in this field. At first, we comprehensively review all related databases and introduce all steps regarding dataset creation, data preprocessing, and method evaluation in p-site prediction. Next, we investigate p-site prediction methods, which are divided into two computational groups: algorithmic and machine learning (ML). Additionally, it is shown that there are basically two main approaches for p-site prediction by ML: conventional and end-to-end deep learning methods, both of which are given an overview. Moreover, this review introduces the most important feature extraction techniques, which have mostly been used in p-site prediction. Finally, we create three test sets from new proteins related to the released version of the database of protein post-translational modifications (dbPTM) in 2022 based on general and human species. Evaluating online p-site prediction tools on newly added proteins introduced in the dbPTM 2022 release, distinct from those in the dbPTM 2019 release, reveals their limitations. In other words, the actual performance of these online p-site prediction tools on unseen proteins is notably lower than the results reported in their respective research papers.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Mahdi Pourmirzaei
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran 14115-111, Iran.
| | - Seyedehsamaneh Shojaeilangari
- Biomedical Engineering Group, Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology (IROST), Tehran 33535-111, Iran
| | - Elham Yavari
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| |
Collapse
|
7
|
Ribeiro AJM, Riziotis IG, Borkakoti N, Thornton JM. Enzyme function and evolution through the lens of bioinformatics. Biochem J 2023; 480:1845-1863. [PMID: 37991346 PMCID: PMC10754289 DOI: 10.1042/bcj20220405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 11/09/2023] [Accepted: 11/14/2023] [Indexed: 11/23/2023]
Abstract
Enzymes have been shaped by evolution over billions of years to catalyse the chemical reactions that support life on earth. Dispersed in the literature, or organised in online databases, knowledge about enzymes can be structured in distinct dimensions, either related to their quality as biological macromolecules, such as their sequence and structure, or related to their chemical functions, such as the catalytic site, kinetics, mechanism, and overall reaction. The evolution of enzymes can only be understood when each of these dimensions is considered. In addition, many of the properties of enzymes only make sense in the light of evolution. We start this review by outlining the main paradigms of enzyme evolution, including gene duplication and divergence, convergent evolution, and evolution by recombination of domains. In the second part, we overview the current collective knowledge about enzymes, as organised by different types of data and collected in several databases. We also highlight some increasingly powerful computational tools that can be used to close gaps in understanding, in particular for types of data that require laborious experimental protocols. We believe that recent advances in protein structure prediction will be a powerful catalyst for the prediction of binding, mechanism, and ultimately, chemical reactions. A comprehensive mapping of enzyme function and evolution may be attainable in the near future.
Collapse
Affiliation(s)
- Antonio J. M. Ribeiro
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Ioannis G. Riziotis
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Neera Borkakoti
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Janet M. Thornton
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| |
Collapse
|
8
|
Shen X, Zhang S, Long J, Chen C, Wang M, Cui Z, Chen B, Tan T. A Highly Sensitive Model Based on Graph Neural Networks for Enzyme Key Catalytic Residue Prediction. J Chem Inf Model 2023; 63:4277-4290. [PMID: 37399293 DOI: 10.1021/acs.jcim.3c00273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023]
Abstract
Determining the catalytic site of enzymes is a great help for understanding the relationship between protein sequence, structure, and function, which provides the basis and targets for designing, modifying, and enhancing enzyme activity. The unique local spatial configuration bound to the substrate at the active center of the enzyme determines the catalytic ability of enzymes and plays an important role in the catalytic site prediction. As a suitable tool, the graph neural network can better understand and identify the residue sites with unique local spatial configurations due to its remarkable ability to characterize the three-dimensional structural features of proteins. Consequently, a novel model for predicting enzyme catalytic sites has been developed, which incorporates a uniquely designed adaptive edge-gated graph attention neural network (AEGAN). This model is capable of effectively handling sequential and structural characteristics of proteins at various levels, and the extracted features enable an accurate description of the local spatial configuration of the enzyme active site by sampling the local space around candidate residues and special design of amino acid physical and chemical properties. To evaluate its performance, the model was compared with existing catalytic site prediction models using different benchmark datasets and achieved the best results on each benchmark dataset. The model exhibited a sensitivity of 0.9659, accuracy of 0.9226, and area under the precision-recall curve (AUPRC) of 0.9241 on the independent test set constructed for evaluation. Furthermore, the F1-score of this model is nearly four times higher than that of the best-performing similar model in previous studies. This research can serve as a valuable tool to help researchers understand protein sequence-structure-function relationships while facilitating the characterization of novel enzymes of unknown function.
Collapse
Affiliation(s)
- Xiaowei Shen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Shiding Zhang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Jianyu Long
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Changjing Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Meng Wang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Ziheng Cui
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Biqiang Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Tianwei Tan
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| |
Collapse
|
9
|
Walther D. Specifics of Metabolite-Protein Interactions and Their Computational Analysis and Prediction. Methods Mol Biol 2023; 2554:179-197. [PMID: 36178627 DOI: 10.1007/978-1-0716-2624-5_12] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Computational approaches to the characterization and prediction of compound-protein interactions have a long research history and are well established, driven primarily by the needs of drug development. While, in principle, many of the computational methods developed in the context of drug development can also be applied directly to the investigation of metabolite-protein interactions, the interactions of metabolites with proteins (enzymes) are characterized by a number of particularities that result from their natural evolutionary origin and their biological and biochemical roles, as well as from a different problem setting when investigating them. In this review, these special aspects will be highlighted and recent research on them and developed computational approaches presented, along with available resources. They concern, among others, binding promiscuity, allostery, the role of posttranslational modifications, molecular steering and crowding effects, and metabolic conversion rate predictions. Recent breakthroughs in the field of protein structure prediction and newly developed machine learning techniques are being discussed as a tremendous opportunity for developing a more detailed molecular understanding of metabolism.
Collapse
Affiliation(s)
- Dirk Walther
- Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany.
| |
Collapse
|
10
|
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:5847242. [PMID: 35799660 PMCID: PMC9256349 DOI: 10.1155/2022/5847242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 06/07/2022] [Indexed: 11/17/2022]
Abstract
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.
Collapse
|
11
|
Paiva VDA, Gomes IDS, Monteiro CR, Mendonça MV, Martins PM, Santana CA, Gonçalves-Almeida V, Izidoro SC, Melo-Minardi RCD, Silveira SDA. Protein structural bioinformatics: An overview. Comput Biol Med 2022; 147:105695. [DOI: 10.1016/j.compbiomed.2022.105695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 06/01/2022] [Accepted: 06/02/2022] [Indexed: 11/27/2022]
|
12
|
Galeb HA, Lamantia A, Robson A, König K, Eichhorn J, Baldock SJ, Ashton MD, Baum JV, Mort RL, Robinson BJ, Schacher FH, Chechik V, Taylor AM, Hardy JG. The Polymerization of Homogentisic Acid in Vitro as a Model for Pyomelanin Formation. MACROMOL CHEM PHYS 2022. [DOI: 10.1002/macp.202100489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Hanaa A. Galeb
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
- Department of Chemistry Science and Arts College, Rabigh Campus King Abdulaziz University Jeddah 21577 Saudi Arabia
| | - Angelo Lamantia
- Department of Physics Lancaster University Lancaster LA1 4YW United Kingdom
| | - Alexander Robson
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
| | - Katja König
- Institut für Organische und Makromolekulare Chemie Friedrich‐Schiller‐Universität Jena Lessingstraße 8 Jena 07743 Germany
| | - Jonas Eichhorn
- Institut für Organische und Makromolekulare Chemie Friedrich‐Schiller‐Universität Jena Lessingstraße 8 Jena 07743 Germany
| | - Sara J. Baldock
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
| | - Mark D. Ashton
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
| | - John V. Baum
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
| | - Richard L. Mort
- Division of Biomedical and Life Sciences Lancaster University Lancaster LA1 4YG United Kingdom
| | - Benjamin J. Robinson
- Department of Physics Lancaster University Lancaster LA1 4YW United Kingdom
- Materials Science Institute Lancaster University Lancaster LA1 4YB United Kingdom
| | - Felix H. Schacher
- Institut für Organische und Makromolekulare Chemie Friedrich‐Schiller‐Universität Jena Lessingstraße 8 Jena 07743 Germany
| | - Victor Chechik
- Department of Chemistry University of York Heslington, York YO10 5DD United Kingdom
| | - Adam M. Taylor
- Lancaster Medical School Lancaster University Lancaster LA1 4YW United Kingdom
| | - John G. Hardy
- Department of Chemistry Lancaster University Lancaster LA1 4YB United Kingdom
- Materials Science Institute Lancaster University Lancaster LA1 4YB United Kingdom
| |
Collapse
|
13
|
Hot spots-making directed evolution easier. Biotechnol Adv 2022; 56:107926. [DOI: 10.1016/j.biotechadv.2022.107926] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 01/04/2022] [Accepted: 02/07/2022] [Indexed: 01/20/2023]
|
14
|
Feehan R, Franklin MW, Slusky JSG. Machine learning differentiates enzymatic and non-enzymatic metals in proteins. Nat Commun 2021; 12:3712. [PMID: 34140507 PMCID: PMC8211803 DOI: 10.1038/s41467-021-24070-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 06/02/2021] [Indexed: 11/09/2022] Open
Abstract
Metalloenzymes are 40% of all enzymes and can perform all seven classes of enzyme reactions. Because of the physicochemical similarities between the active sites of metalloenzymes and inactive metal binding sites, it is challenging to differentiate between them. Yet distinguishing these two classes is critical for the identification of both native and designed enzymes. Because of similarities between catalytic and non-catalytic metal binding sites, finding physicochemical features that distinguish these two types of metal sites can indicate aspects that are critical to enzyme function. In this work, we develop the largest structural dataset of enzymatic and non-enzymatic metalloprotein sites to date. We then use a decision-tree ensemble machine learning model to classify metals bound to proteins as enzymatic or non-enzymatic with 92.2% precision and 90.1% recall. Our model scores electrostatic and pocket lining features as more important than pocket volume, despite the fact that volume is the most quantitatively different feature between enzyme and non-enzymatic sites. Finally, we find our model has overall better performance in a side-to-side comparison against other methods that differentiate enzymatic from non-enzymatic sequences. We anticipate that our model's ability to correctly identify which metal sites are responsible for enzymatic activity could enable identification of new enzymatic mechanisms and de novo enzyme design.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA
| | - Meghan W Franklin
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, Lawrence, KS, USA.
- Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA.
| |
Collapse
|
15
|
Das S, Scholes HM, Sen N, Orengo C. CATH functional families predict functional sites in proteins. Bioinformatics 2021; 37:1099-1106. [PMID: 33135053 PMCID: PMC8150129 DOI: 10.1093/bioinformatics/btaa937] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Revised: 09/30/2020] [Accepted: 10/27/2020] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). RESULTS FunSite's prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite's performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyze which structural and evolutionary features are most predictive for functional sites. AVAILABILITYAND IMPLEMENTATION https://github.com/UCL/cath-funsite-predictor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- PrecisionLife Ltd., Long Hanborough, OX29 8LJ Oxford, UK
| | - Harry M Scholes
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| |
Collapse
|
16
|
Feehan R, Montezano D, Slusky JSG. Machine learning for enzyme engineering, selection and design. Protein Eng Des Sel 2021; 34:gzab019. [PMID: 34296736 PMCID: PMC8299298 DOI: 10.1093/protein/gzab019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 06/18/2021] [Accepted: 06/23/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is a useful computational tool for large and complex tasks such as those in the field of enzyme engineering, selection and design. In this review, we examine enzyme-related applications of machine learning. We start by comparing tools that can identify the function of an enzyme and the site responsible for that function. Then we detail methods for optimizing important experimental properties, such as the enzyme environment and enzyme reactants. We describe recent advances in enzyme systems design and enzyme design itself. Throughout we compare and contrast the data and algorithms used for these tasks to illustrate how the algorithms and data can be best used by future designers.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Daniel Montezano
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
- Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence, KS 66045-7600, USA
| |
Collapse
|
17
|
Gao J, Miao Z, Zhang Z, Wei H, Kurgan L. Prediction of Ion Channels and their Types from Protein Sequences: Comprehensive Review and Comparative Assessment. Curr Drug Targets 2020; 20:579-592. [PMID: 30360734 DOI: 10.2174/1389450119666181022153942] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/03/2018] [Accepted: 10/04/2018] [Indexed: 12/20/2022]
Abstract
BACKGROUND Ion channels are a large and growing protein family. Many of them are associated with diseases, and consequently, they are targets for over 700 drugs. Discovery of new ion channels is facilitated with computational methods that predict ion channels and their types from protein sequences. However, these methods were never comprehensively compared and evaluated. OBJECTIVE We offer first-of-its-kind comprehensive survey of the sequence-based predictors of ion channels. We describe eight predictors that include five methods that predict ion channels, their types, and four classes of the voltage-gated channels. We also develop and use a new benchmark dataset to perform comparative empirical analysis of the three currently available predictors. RESULTS While several methods that rely on different designs were published, only a few of them are currently available and offer a broad scope of predictions. Support and availability after publication should be required when new methods are considered for publication. Empirical analysis shows strong performance for the prediction of ion channels and modest performance for the prediction of ion channel types and voltage-gated channel classes. We identify a substantial weakness of current methods that cannot accurately predict ion channels that are categorized into multiple classes/types. CONCLUSION Several predictors of ion channels are available to the end users. They offer practical levels of predictive quality. Methods that rely on a larger and more diverse set of predictive inputs (such as PSIONplus) are more accurate. New tools that address multi-label prediction of ion channels should be developed.
Collapse
Affiliation(s)
- Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Zhen Miao
- College of Life Sciences, Nankai University, Tianjin, China
| | - Zhaopeng Zhang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Hong Wei
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, United States
| |
Collapse
|
18
|
Cheng G, Chen Q, Zhang R. Prediction of phosphorylation sites based on granular support vector machine. GRANULAR COMPUTING 2019. [DOI: 10.1007/s41066-019-00202-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
19
|
Gil N, Fiser A. The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis. Bioinformatics 2019; 35:12-19. [PMID: 29947739 PMCID: PMC6298051 DOI: 10.1093/bioinformatics/bty523] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 04/20/2018] [Accepted: 06/26/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| |
Collapse
|
20
|
Balanites aegyptiaca (L.) Del. for dermatophytoses: Ascertaining the efficacy and mode of action through experimental and computational approaches. INFORMATICS IN MEDICINE UNLOCKED 2019. [DOI: 10.1016/j.imu.2019.100177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
21
|
Liu Y, Wang X, Liu B. IDP⁻CRF: Intrinsically Disordered Protein/Region Identification Based on Conditional Random Fields. Int J Mol Sci 2018; 19:E2483. [PMID: 30135358 PMCID: PMC6164615 DOI: 10.3390/ijms19092483] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 08/14/2018] [Accepted: 08/18/2018] [Indexed: 12/16/2022] Open
Abstract
Accurate prediction of intrinsically disordered proteins/regions is one of the most important tasks in bioinformatics, and some computational predictors have been proposed to solve this problem. How to efficiently incorporate the sequence-order effect is critical for constructing an accurate predictor because disordered region distributions show global sequence patterns. In order to capture these sequence patterns, several sequence labelling models have been applied to this field, such as conditional random fields (CRFs). However, these methods suffer from certain disadvantages. In this study, we proposed a new computational predictor called IDP⁻CRF, which is trained on an updated benchmark dataset based on the MobiDB database and the DisProt database, and incorporates more comprehensive sequence-based features, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structures, and relative solvent accessibilities. Experimental results on the benchmark dataset and two independent datasets show that IDP⁻CRF outperforms 25 existing state-of-the-art methods in this field, demonstrating that IDP⁻CRF is a very useful tool for identifying IDPs/IDRs (intrinsically disordered proteins/regions). We anticipate that IDP⁻CRF will facilitate the development of protein sequence analysis.
Collapse
Affiliation(s)
- Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| |
Collapse
|
22
|
Song J, Li F, Takemoto K, Haffari G, Akutsu T, Chou KC, Webb GI. PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol 2018; 443:125-137. [DOI: 10.1016/j.jtbi.2018.01.023] [Citation(s) in RCA: 95] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 01/17/2018] [Accepted: 01/18/2018] [Indexed: 10/18/2022]
|
23
|
Prediction of Protein Phosphorylation Sites by Integrating Secondary Structure Information and Other One-Dimensional Structural Properties. Methods Mol Biol 2018; 1484:265-274. [PMID: 27787832 DOI: 10.1007/978-1-4939-6406-2_18] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Studies on phosphorylation are important but challenging for both wet-bench experiments and computational studies, and accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wide variety of species. Here, we describe a phosphorylation site prediction webserver, PhosphoSVM, that employs Support Vector Machine to combine protein secondary structure information and seven other one-dimensional structural properties, including Shannon entropy, relative entropy, predicted protein disorder information, predicted solvent accessible area, amino acid overlapping properties, averaged cumulative hydrophobicity, and subsequence k-nearest neighbor profiles. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test data set were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, respectively. This algorithm with the optimally trained model was implemented as a webserver. The webserver, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .
Collapse
|
24
|
Choudhary P, Kumar S, Bachhawat AK, Pandit SB. CSmetaPred: a consensus method for prediction of catalytic residues. BMC Bioinformatics 2017; 18:583. [PMID: 29273005 PMCID: PMC5741869 DOI: 10.1186/s12859-017-1987-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 12/05/2017] [Indexed: 01/27/2023] Open
Abstract
Background Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Results Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. Conclusions The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/. Electronic supplementary material The online version of this article (10.1186/s12859-017-1987-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Preeti Choudhary
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India
| | - Shailesh Kumar
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India.,Laboratory of Biochemistry and Genetics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anand Kumar Bachhawat
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India
| | - Shashi Bhushan Pandit
- Department of Biological Sciences, Indian Institute of Science Education and Research, Mohali, Knowledge City, Sector 81, SAS Nagar, Manuali PO 140306, India.
| |
Collapse
|
25
|
CRHunter: integrating multifaceted information to predict catalytic residues in enzymes. Sci Rep 2016; 6:34044. [PMID: 27665935 PMCID: PMC5036049 DOI: 10.1038/srep34044] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 09/07/2016] [Indexed: 11/08/2022] Open
Abstract
A variety of algorithms have been developed for catalytic residue prediction based on either feature- or template-based methodology. However, no studies have systematically compared these two strategies and further considered whether their combination could improve the prediction performance. Herein, we developed an integrative algorithm named CRHunter by simultaneously using the complementarity between feature- and template-based methodologies and that between structural and sequence information. Several novel structural features were generated by the Delaunay triangulation and Laplacian transformation of enzyme structures. Combining these features with traditional descriptors, we invented two support vector machine feature predictors based on both structural and sequence information. Furthermore, we established two template predictors using structure and profile alignments. Evaluated on datasets with different levels of homology, our feature predictors achieve relatively stable performance, whereas our template predictors yield poor results when the homological relationships become weak. Nevertheless, the hybrid algorithm CRHunter consistently achieves optimal performance among all our predictors. We also illustrate that our methodology can be applied to the predicted structures of enzymes. Compared with state-of-the-art methods, CRHunter yields comparable or better performance on various datasets. Finally, the application of this algorithm to structural genomics targets sheds light on solved protein structures with unknown functions.
Collapse
|
26
|
Santhosh R, Satheesh SN, Gurusaran M, Michael D, Sekar K, Jeyakanthan J. NIMS: a database on nucleobase compounds and their interactions in macromolecular structures. J Appl Crystallogr 2016. [DOI: 10.1107/s1600576716006208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The intense exploration of nucleotide-binding protein structures has created a whirlwind in the field of structural biology and bioinformatics. This has led to the conception and birth of NIMS. This database is a collection of detailed data on the nucleobases, nucleosides and nucleotides, along with their analogues as well as the protein structures to which they bind. Interaction details such as the interacting residues and all associated values have been made available. As a pioneering step, the diffraction precision index for protein structures, the atomic uncertainty for each atom, and the computed errors on the interatomic distances and angles are available in the database. Apart from the above, provision has been made to visualize the three-dimensional structures of both ligands and protein–ligand structures and their interactions inJmolas well asJSmol. One of the salient features of NIMS is that it has been interfaced with a user-friendly and query-based efficient search engine. It was conceived and developed with the aim of serving a significant section of researchers working in the area of protein and nucleobase complexes. NIMS is freely available online at http://iris.physics.iisc.ernet.in/nims and it is hoped that it will prove to be an invaluable asset.
Collapse
|
27
|
Wei ZS, Han K, Yang JY, Shen HB, Yu DJ. Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.02.022] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
28
|
Heffernan R, Dehzangi A, Lyons J, Paliwal K, Sharma A, Wang J, Sattar A, Zhou Y, Yang Y. Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins. Bioinformatics 2015; 32:843-9. [DOI: 10.1093/bioinformatics/btv665] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 11/07/2015] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation: Solvent exposure of amino acid residues of proteins plays an important role in understanding and predicting protein structure, function and interactions. Solvent exposure can be characterized by several measures including solvent accessible surface area (ASA), residue depth (RD) and contact numbers (CN). More recently, an orientation-dependent contact number called half-sphere exposure (HSE) was introduced by separating the contacts within upper and down half spheres defined according to the Cα-Cβ (HSEβ) vector or neighboring Cα-Cα vectors (HSEα). HSEα calculated from protein structures was found to better describe the solvent exposure over ASA, CN and RD in many applications. Thus, a sequence-based prediction is desirable, as most proteins do not have experimentally determined structures. To our best knowledge, there is no method to predict HSEα and only one method to predict HSEβ.
Results: This study developed a novel method for predicting both HSEα and HSEβ (SPIDER-HSE) that achieved a consistent performance for 10-fold cross validation and two independent tests. The correlation coefficients between predicted and measured HSEβ (0.73 for upper sphere, 0.69 for down sphere and 0.76 for contact numbers) for the independent test set of 1199 proteins are significantly higher than existing methods. Moreover, predicted HSEα has a higher correlation coefficient (0.46) to the stability change by residue mutants than predicted HSEβ (0.37) and ASA (0.43). The results, together with its easy Cα-atom-based calculation, highlight the potential usefulness of predicted HSEα for protein structure prediction and refinement as well as function prediction.
Availability and implementation: The method is available at http://sparks-lab.org.
Contact: yuedong.yang@griffith.edu.au or yaoqi.zhou@griffith.edu.au
Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rhys Heffernan
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Abdollah Dehzangi
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- Medical Research Center (MRC), Department of Psychiatry, University of Iowa, Iowa City, USA,
| | - James Lyons
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering, Griffith University, Brisbane, Australia,
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji,
| | - Jihua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, Shandong 253023, China,
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia,
- National ICT Australia (NICTA), Brisbane, Australia and
| | - Yaoqi Zhou
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, Shandong 253023, China,
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| |
Collapse
|
29
|
Aubailly S, Piazza F. Cutoff lensing: predicting catalytic sites in enzymes. Sci Rep 2015; 5:14874. [PMID: 26445900 PMCID: PMC4597221 DOI: 10.1038/srep14874] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2015] [Accepted: 09/10/2015] [Indexed: 01/12/2023] Open
Abstract
Predicting function-related amino acids in proteins with unknown function or unknown allosteric binding sites in drug-targeted proteins is a task of paramount importance in molecular biomedicine. In this paper we introduce a simple, light and computationally inexpensive structure-based method to identify catalytic sites in enzymes. Our method, termed cutoff lensing, is a general procedure consisting in letting the cutoff used to build an elastic network model increase to large values. A validation of our method against a large database of annotated enzymes shows that optimal values of the cutoff exist such that three different structure-based indicators allow one to recover a maximum of the known catalytic sites. Interestingly, we find that the larger the structures the greater the predictive power afforded by our method. Possible ways to combine the three indicators into a single figure of merit and into a specific sequential analysis are suggested and discussed with reference to the classic case of HIV-protease. Our method could be used as a complement to other sequence- and/or structure-based methods to narrow the results of large-scale screenings.
Collapse
Affiliation(s)
- Simon Aubailly
- Université d'Orléans, Centre de Biophysique Moléculaire, CNRS-UPR4301, Rue C. Sadron, 45071, Orléans, France
| | - Francesco Piazza
- Université d'Orléans, Centre de Biophysique Moléculaire, CNRS-UPR4301, Rue C. Sadron, 45071, Orléans, France
| |
Collapse
|
30
|
Wei ZS, Yang JY, Shen HB, Yu DJ. A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites. IEEE Trans Nanobioscience 2015; 14:746-60. [DOI: 10.1109/tnb.2015.2475359] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
31
|
PINGU: PredIction of eNzyme catalytic residues usinG seqUence information. PLoS One 2015; 10:e0135122. [PMID: 26261982 PMCID: PMC4532418 DOI: 10.1371/journal.pone.0135122] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2015] [Accepted: 07/17/2015] [Indexed: 11/19/2022] Open
Abstract
Identification of catalytic residues can help unveil interesting attributes of enzyme function for various therapeutic and industrial applications. Based on their biochemical roles, the number of catalytic residues and sequence lengths of enzymes vary. This article describes a prediction approach (PINGU) for such a scenario. It uses models trained using physicochemical properties and evolutionary information of 650 non-redundant enzymes (2136 catalytic residues) in a support vector machines architecture. Independent testing on 200 non-redundant enzymes (683 catalytic residues) in predefined prediction settings, i.e., with non-catalytic per catalytic residue ranging from 1 to 30, suggested that the prediction approach was highly sensitive and specific, i.e., 80% or above, over the incremental challenges. To learn more about the discriminatory power of PINGU in real scenarios, where the prediction challenge is variable and susceptible to high false positives, the best model from independent testing was used on 60 diverse enzymes. Results suggested that PINGU was able to identify most catalytic residues and non-catalytic residues properly with 80% or above accuracy, sensitivity and specificity. The effect of false positives on precision was addressed in this study by application of predicted ligand-binding residue information as a post-processing filter. An overall improvement of 20% in F-measure and 0.138 in Correlation Coefficient with 16% enhanced precision could be achieved. On account of its encouraging performance, PINGU is hoped to have eventual applications in boosting enzyme engineering and novel drug discovery.
Collapse
|
32
|
Fang C, Noguchi T, Yamana H. Analysis of evolutionary conservation patterns and their influence on identifying protein functional sites. J Bioinform Comput Biol 2015; 12:1440003. [PMID: 25362840 DOI: 10.1142/s0219720014400034] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Evolutionary conservation information included in position-specific scoring matrix (PSSM) has been widely adopted by sequence-based methods for identifying protein functional sites, because all functional sites, whether in ordered or disordered proteins, are found to be conserved at some extent. However, different functional sites have different conservation patterns, some of them are linear contextual, some of them are mingled with highly variable residues, and some others seem to be conserved independently. Every value in PSSMs is calculated independently of each other, without carrying the contextual information of residues in the sequence. Therefore, adopting the direct output of PSSM for prediction fails to consider the relationship between conservation patterns of residues and the distribution of conservation scores in PSSMs. In order to demonstrate the importance of combining PSSMs with the specific conservation patterns of functional sites for prediction, three different PSSM-based methods for identifying three kinds of functional sites have been analyzed. Results suggest that, different PSSM-based methods differ in their capability to identify different patterns of functional sites, and better combining PSSMs with the specific conservation patterns of residues would largely facilitate the prediction.
Collapse
Affiliation(s)
- Chun Fang
- Department of Computer Science and Engineering of Shandong, University of Technology, Shandong 255049, P. R. China
| | | | | |
Collapse
|
33
|
Xiao X, Hui MJ, Liu Z, Qiu WR. iCataly-PseAAC: Identification of Enzymes Catalytic Sites Using Sequence Evolution Information with Grey Model GM (2,1). J Membr Biol 2015; 248:1033-41. [DOI: 10.1007/s00232-015-9815-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 06/06/2015] [Indexed: 11/25/2022]
|
34
|
EXIA2: web server of accurate and rapid protein catalytic residue prediction. BIOMED RESEARCH INTERNATIONAL 2014; 2014:807839. [PMID: 25295274 PMCID: PMC4177735 DOI: 10.1155/2014/807839] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Revised: 05/27/2014] [Accepted: 06/11/2014] [Indexed: 11/18/2022]
Abstract
We propose a method (EXIA2) of catalytic residue prediction based on protein structure without needing homology information. The method is based on the special side chain orientation of catalytic residues. We found that the side chain of catalytic residues usually points to the center of the catalytic site. The special orientation is usually observed in catalytic residues but not in noncatalytic residues, which usually have random side chain orientation. The method is shown to be the most accurate catalytic residue prediction method currently when combined with PSI-Blast sequence conservation. It performs better than other competing methods on several benchmark datasets that include over 1,200 enzyme structures. The areas under the ROC curve (AUC) on these benchmark datasets are in the range from 0.934 to 0.968.
Collapse
|
35
|
Yang F, Xu YY, Shen HB. Many local pattern texture features: which is better for image-based multilabel human protein subcellular localization classification? ScientificWorldJournal 2014; 2014:429049. [PMID: 25050396 PMCID: PMC4094881 DOI: 10.1155/2014/429049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Accepted: 05/22/2014] [Indexed: 01/14/2023] Open
Abstract
Human protein subcellular location prediction can provide critical knowledge for understanding a protein's function. Since significant progress has been made on digital microscopy, automated image-based protein subcellular location classification is urgently needed. In this paper, we aim to investigate more representative image features that can be effectively used for dealing with the multilabel subcellular image samples. We prepared a large multilabel immunohistochemistry (IHC) image benchmark from the Human Protein Atlas database and tested the performance of different local texture features, including completed local binary pattern, local tetra pattern, and the standard local binary pattern feature. According to our experimental results from binary relevance multilabel machine learning models, the completed local binary pattern, and local tetra pattern are more discriminative for describing IHC images when compared to the traditional local binary pattern descriptor. The combination of these two novel local pattern features and the conventional global texture features is also studied. The enhanced performance of final binary relevance classification model trained on the combined feature space demonstrates that different features are complementary to each other and thus capable of improving the accuracy of classification.
Collapse
Affiliation(s)
- Fan Yang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
- Key Laboratory of Optic-Electronic and Communication, Jiangxi Science & Technology Normal University, Nanchang 330013, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Ying-Ying Xu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
36
|
Yang F, Xu YY, Wang ST, Shen HB. Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2013.10.034] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
37
|
Dou Y, Yao B, Zhang C. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids 2014; 46:1459-69. [DOI: 10.1007/s00726-014-1711-5] [Citation(s) in RCA: 105] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2013] [Accepted: 02/21/2014] [Indexed: 02/01/2023]
|
38
|
Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 2014; 9:e86703. [PMID: 24475169 PMCID: PMC3901691 DOI: 10.1371/journal.pone.0086703] [Citation(s) in RCA: 115] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2013] [Accepted: 12/10/2013] [Indexed: 11/22/2022] Open
Abstract
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
Collapse
|
39
|
Yang X, Guo Y, Luo J, Pu X, Li M. Effective identification of Gram-negative bacterial type III secreted effectors using position-specific residue conservation profiles. PLoS One 2013; 8:e84439. [PMID: 24391954 PMCID: PMC3877298 DOI: 10.1371/journal.pone.0084439] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Accepted: 11/07/2013] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Type III secretion systems (T3SSs) are central to the pathogenesis and specifically deliver their secreted substrates (type III secreted proteins, T3SPs) into host cells. Since T3SPs play a crucial role in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T3SSs. This study reports a novel and effective method for identifying the distinctive residues which are conserved different from other SPs for T3SPs prediction. Moreover, the importance of several sequence features was evaluated and further, a promising prediction model was constructed. RESULTS Based on the conservation profiles constructed by a position-specific scoring matrix (PSSM), 52 distinctive residues were identified. To our knowledge, this is the first attempt to identify the distinct residues of T3SPs. Of the 52 distinct residues, the first 30 amino acid residues are all included, which is consistent with previous studies reporting that the secretion signal generally occurs within the first 30 residue positions. However, the remaining 22 positions span residues 30-100 were also proven by our method to contain important signal information for T3SP secretion because the translocation of many effectors also depends on the chaperone-binding residues that follow the secretion signal. For further feature optimisation and compression, permutation importance analysis was conducted to select 62 optimal sequence features. A prediction model across 16 species was developed using random forest to classify T3SPs and non-T3 SPs, with high receiver operating curve of 0.93 in the 10-fold cross validation and an accuracy of 94.29% for the test set. Moreover, when performing on a common independent dataset, the results demonstrate that our method outperforms all the others published to date. Finally, the novel, experimentally confirmed T3 effectors were used to further demonstrate the model's correct application. The model and all data used in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/T3SPs.zip.
Collapse
Affiliation(s)
- Xiaojiao Yang
- College of Chemistry, Sichuan University, Chengdu, P.R.China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, P.R.China
| | - Jiesi Luo
- College of Chemistry, Sichuan University, Chengdu, P.R.China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu, P.R.China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, P.R.China
| |
Collapse
|
40
|
An accurate method for prediction of protein-ligand binding site on protein surface using SVM and statistical depth function. BIOMED RESEARCH INTERNATIONAL 2013; 2013:409658. [PMID: 24195070 PMCID: PMC3806129 DOI: 10.1155/2013/409658] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Revised: 08/15/2013] [Accepted: 08/29/2013] [Indexed: 11/17/2022]
Abstract
Since proteins carry out their functions through interactions with other molecules, accurately identifying the protein-ligand binding site plays an important role in protein functional annotation and rational drug discovery. In the past two decades, a lot of algorithms were present to predict the protein-ligand binding site. In this paper, we introduce statistical depth function to define negative samples and propose an SVM-based method which integrates sequence and structural information to predict binding site. The results show that the present method performs better than the existent ones. The accuracy, sensitivity, and specificity on training set are 77.55%, 56.15%, and 87.96%, respectively; on the independent test set, the accuracy, sensitivity, and specificity are 80.36%, 53.53%, and 92.38%, respectively.
Collapse
|
41
|
Protein structure based prediction of catalytic residues. BMC Bioinformatics 2013; 14:63. [PMID: 23433045 PMCID: PMC3598644 DOI: 10.1186/1471-2105-14-63] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2012] [Accepted: 02/17/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. RESULTS We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods. CONCLUSIONS We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.
Collapse
|
42
|
Finding protein targets for small biologically relevant ligands across fold space using inverse ligand binding predictions. Structure 2013; 20:1815-22. [PMID: 23141694 DOI: 10.1016/j.str.2012.09.011] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2012] [Revised: 08/14/2012] [Accepted: 09/16/2012] [Indexed: 01/12/2023]
Abstract
Inverse ligand binding prediction utilizes a few protein-ligand (drug) complexes to predict other secondary therapeutic and off-targets of a given drug molecule on a proteomic scale. We adapt two binding site predictors, FINDSITE and SMAP, to perform the inverse predictions and evaluate them on over 30 representative ligands. Use of just one complex allows the identification of other protein targets; the availability of additional complexes improves the results. Both methods offer comparable quality when using three complexes with diverse proteins. SMAP is better when fewer complexes are available, while FINDSITE provides stronger predictions for smaller ligands. We propose a consensus that combines (and outperforms) the two complementary approaches implemented by FINDSITE and SMAP. Most importantly, we demonstrate that these methods successfully find distant targets that belong to structurally different folds compared to the proteins in the input complexes.
Collapse
|
43
|
Chen Z, Wang Y, Zhai YF, Song J, Zhang Z. ZincExplorer: an accurate hybrid method to improve the prediction of zinc-binding sites from protein sequences. MOLECULAR BIOSYSTEMS 2013; 9:2213-22. [DOI: 10.1039/c3mb70100j] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
44
|
An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins. PLoS One 2012; 7:e49716. [PMID: 23166753 PMCID: PMC3499040 DOI: 10.1371/journal.pone.0049716] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2012] [Accepted: 10/12/2012] [Indexed: 11/30/2022] Open
Abstract
Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.
Collapse
|
45
|
Accurate prediction of protein catalytic residues by side chain orientation and residue contact density. PLoS One 2012; 7:e47951. [PMID: 23110141 PMCID: PMC3480458 DOI: 10.1371/journal.pone.0047951] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2012] [Accepted: 09/18/2012] [Indexed: 11/19/2022] Open
Abstract
Prediction of protein catalytic residues provides useful information for the studies of protein functions. Most of the existing methods combine both structure and sequence information but heavily rely on sequence conservation from multiple sequence alignments. The contribution of structure information is usually less than that of sequence conservation in existing methods. We found a novel structure feature, residue side chain orientation, which is the first structure-based feature that achieves prediction results comparable to that of evolutionary sequence conservation. We developed a structure-based method, Enzyme Catalytic residue SIde-chain Arrangement (EXIA), which is based on residue side chain orientations and backbone flexibility of protein structure. The prediction that uses EXIA outperforms existing structure-based features. The prediction quality of combing EXIA and sequence conservation exceeds that of the state-of-the-art prediction methods. EXIA is designed to predict catalytic residues from single protein structure without needing sequence or structure alignments. It provides invaluable information when there is no sufficient or reliable homology information for target protein. We found that catalytic residues have very special side chain orientation and designed the EXIA method based on the newly discovered feature. It was also found that EXIA performs well for a dataset of enzymes without any bounded ligand in their crystallographic structures.
Collapse
|
46
|
Han L, Zhang YJ, Song J, Liu MS, Zhang Z. Identification of catalytic residues using a novel feature that integrates the microenvironment and geometrical location properties of residues. PLoS One 2012; 7:e41370. [PMID: 22829945 PMCID: PMC3400608 DOI: 10.1371/journal.pone.0041370] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2012] [Accepted: 06/20/2012] [Indexed: 11/18/2022] Open
Abstract
Enzymes play a fundamental role in almost all biological processes and identification of catalytic residues is a crucial step for deciphering the biological functions and understanding the underlying catalytic mechanisms. In this work, we developed a novel structural feature called MEDscore to identify catalytic residues, which integrated the microenvironment (ME) and geometrical properties of amino acid residues. Firstly, we converted a residue's ME into a series of spatially neighboring residue pairs, whose likelihood of being located in a catalytic ME was deduced from a benchmark enzyme dataset. We then calculated an ME-based score, termed as MEscore, by summing up the likelihood of all residue pairs. Secondly, we defined a parameter called Dscore to measure the relative distance of a residue to the center of the protein, provided that catalytic residues are typically located in the center of the protein structure. Finally, we defined the MEDscore feature based on an effective nonlinear integration of MEscore and Dscore. When evaluated on a well-prepared benchmark dataset using five-fold cross-validation tests, MEDscore achieved a robust performance in identifying catalytic residues with an AUC1.0 of 0.889. At a ≤ 10% false positive rate control, MEDscore correctly identified approximately 70% of the catalytic residues. Remarkably, MEDscore achieved a competitive performance compared with the residue conservation score (e.g. CONscore), the most informative singular feature predominantly employed to identify catalytic residues. To the best of our knowledge, MEDscore is the first singular structural feature exhibiting such an advantage. More importantly, we found that MEDscore is complementary with CONscore and a significantly improved performance can be achieved by combining CONscore with MEDscore in a linear manner. As an implementation of this work, MEDscore has been made freely accessible at http://protein.cau.edu.cn/mepi/.
Collapse
Affiliation(s)
- Lei Han
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, People's Republic of China
| | - Yong-Jun Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, People's Republic of China
| | - Jiangning Song
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, People's Republic of China
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Ming S. Liu
- CSIRO - Mathematics, Informatics and Statistics, Clayton, Victoria, Australia
- * E-mail: (MSL); (ZZ)
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, People's Republic of China
- * E-mail: (MSL); (ZZ)
| |
Collapse
|
47
|
Gao J, Faraggi E, Zhou Y, Ruan J, Kurgan L. BEST: improved prediction of B-cell epitopes from antigen sequences. PLoS One 2012; 7:e40104. [PMID: 22761950 PMCID: PMC3384636 DOI: 10.1371/journal.pone.0040104] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Accepted: 05/31/2012] [Indexed: 12/16/2022] Open
Abstract
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.
Collapse
Affiliation(s)
- Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China
- * E-mail: (JG); (LK)
| | - Eshel Faraggi
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
| | - Jishou Ruan
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China
- State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, People's Republic of China
| | - Lukasz Kurgan
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada
- * E-mail: (JG); (LK)
| |
Collapse
|
48
|
Dou Y, Wang J, Yang J, Zhang C. L1pred: a sequence-based prediction tool for catalytic residues in enzymes with the L1-logreg classifier. PLoS One 2012; 7:e35666. [PMID: 22558194 PMCID: PMC3338704 DOI: 10.1371/journal.pone.0035666] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 03/19/2012] [Indexed: 12/01/2022] Open
Abstract
To understand enzyme functions, identifying the catalytic residues is a usual first step. Moreover, knowledge about catalytic residues is also useful for protein engineering and drug-design. However, to experimentally identify catalytic residues remains challenging for reasons of time and cost. Therefore, computational methods have been explored to predict catalytic residues. Here, we developed a new algorithm, L1pred, for catalytic residue prediction, by using the L1-logreg classifier to integrate eight sequence-based scoring functions. We tested L1pred and compared it against several existing sequence-based methods on carefully designed datasets Data604 and Data63. With ten-fold cross-validation, L1pred showed the area under precision-recall curve (AUPR) and the area under ROC curve (AUC) of 0.2198 and 0.9494 on the training dataset, Data604, respectively. In addition, on the independent test dataset, Data63, it showed the AUPR and AUC values of 0.2636 and 0.9375, respectively. Compared with other sequence-based methods, L1pred showed the best performance on both datasets. We also analyzed the importance of each attribute in the algorithm, and found that all the scores contributed more or less equally to the L1pred performance.
Collapse
Affiliation(s)
- Yongchao Dou
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
| | - Jun Wang
- Scientific Computing Key Laboratory of Shanghai Universities, Shanghai, People’s Republic of China
- Department of Mathematics, Shanghai Normal University, Shanghai, People’s Republic of China
| | - Jialiang Yang
- MPI-Institute of Computational Biology, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Chi Zhang
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
- * E-mail:
| |
Collapse
|
49
|
Zhao X, Li J, Huang Y, Ma Z, Yin M. Prediction of bioluminescent proteins using auto covariance transformation of evolutional profiles. Int J Mol Sci 2012; 13:3650-3660. [PMID: 22489173 PMCID: PMC3317733 DOI: 10.3390/ijms13033650] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Revised: 02/21/2012] [Accepted: 03/05/2012] [Indexed: 12/21/2022] Open
Abstract
Bioluminescent proteins are important for various cellular processes, such as gene expression analysis, drug discovery, bioluminescent imaging, toxicity determination, and DNA sequencing studies. Hence, the correct identification of bioluminescent proteins is of great importance both for helping genome annotation and providing a supplementary role to experimental research to obtain insight into bioluminescent proteins' functions. However, few computational methods are available for identifying bioluminescent proteins. Therefore, in this paper we develop a new method to predict bioluminescent proteins using a model based on position specific scoring matrix and auto covariance. Tested by 10-fold cross-validation and independent test, the accuracy of the proposed model reaches 85.17% for the training dataset and 90.71% for the testing dataset respectively. These results indicate that our predictor is a useful tool to predict bioluminescent proteins. This is the first study in which evolutionary information and local sequence environment information have been successfully integrated for predicting bioluminescent proteins. A web server (BLPre) that implements the proposed predictor is freely available.
Collapse
Affiliation(s)
- Xiaowei Zhao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China; E-Mails: (X.Z.); (J.L.)
- School of Life Sciences, Northeast Normal University, Changchun 130024, China
| | - Jiakui Li
- School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China; E-Mails: (X.Z.); (J.L.)
| | - Yanxin Huang
- National Engineering Laboratory for Druggable Gene and Protein Screening, Northeast Normal University, Changchun 130024, China
| | - Zhiqiang Ma
- School of Life Sciences, Northeast Normal University, Changchun 130024, China
| | - Minghao Yin
- School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China; E-Mails: (X.Z.); (J.L.)
| |
Collapse
|
50
|
Song J, Tan H, Wang M, Webb GI, Akutsu T. TANGLE: two-level support vector regression approach for protein backbone torsion angle prediction from primary sequences. PLoS One 2012; 7:e30361. [PMID: 22319565 PMCID: PMC3271071 DOI: 10.1371/journal.pone.0030361] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 12/14/2011] [Indexed: 12/29/2022] Open
Abstract
Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia
- * E-mail: (JS); (GIW); (TA)
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| |
Collapse
|