1
|
Chivot L, Mathieux N, Cosson A, Bridier-Nahmias A, Favennec L, Gelly JC, Clain J, Coppée R. CONSTRUCT: an algorithmic tool for identifying functional or structurally important regions in protein tertiary structure. Bioinformatics 2025; 41:btaf166. [PMID: 40220324 PMCID: PMC12034385 DOI: 10.1093/bioinformatics/btaf166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 04/02/2025] [Accepted: 04/10/2025] [Indexed: 04/14/2025] Open
Abstract
MOTIVATION Evolutionary rates in protein-coding genes vary widely, reflecting functional and/or structural constraints. Essential or highly expressed proteins tend to evolve more slowly, and within a protein, different amino acid sites experience distinct selective pressures. Accurately modeling this variation is critical for identifying functional and/or structurally important amino acid sites. Standard methods assume independent substitution rates across sites, and the most conserved ones are widely distributed in protein tertiary structure. This is biologically unrealistic, as functional sites tend to cluster in 3D space. RESULTS Here, we developed CONSTRUCT, an improved strategy for detecting functional and structurally important regions in protein tertiary structure. Given a set of orthologous sequences, CONSTRUCT first estimates site-specific substitution rates using the Rate4site model. These rates are then weighted by the rates of neighboring amino acid sites within an optimally defined window size, determined by the strongest spatial correlation. To refine clustering detection, CONSTRUCT can analyze either Cα atoms or the center of mass of amino acid sites, accounting for side chain orientation. Extensive simulations and validation on 14 functionally characterized proteins of diverse sizes, interspecies conservation levels, and taxonomic origins demonstrated the robustness of CONSTRUCT. The results highlight CONSTRUCT as a powerful tool for guiding site-directed mutagenesis experiments aimed at elucidating protein function. AVAILABILITY AND IMPLEMENTATION The CONSTRUCT program and documentation are freely available at https://github.com/Rcoppee/CONSTRUCT.
Collapse
Affiliation(s)
- Lucas Chivot
- Université de Rouen Normandie, Laboratoire de Parasitologie-Mycologie, ESCAPE, F-76000 Rouen, France
| | - Noé Mathieux
- Université de Rouen Normandie, Laboratoire de Parasitologie-Mycologie, ESCAPE, F-76000 Rouen, France
| | - Anna Cosson
- Université de Rouen Normandie, Laboratoire de Parasitologie-Mycologie, ESCAPE, F-76000 Rouen, France
| | | | - Loïc Favennec
- Université de Rouen Normandie, Laboratoire de Parasitologie-Mycologie, ESCAPE, F-76000 Rouen, France
| | - Jean-Christophe Gelly
- Université Paris Cité et Université des Antilles et Université de la Réunion, Inserm, BIGR, F-75015 Paris, France
| | - Jérôme Clain
- Université Paris Cité, IRD, Inserm, MERIT, F-75006 Paris, France
| | - Romain Coppée
- Université de Rouen Normandie, Laboratoire de Parasitologie-Mycologie, ESCAPE, F-76000 Rouen, France
| |
Collapse
|
2
|
Snoeck S, Lee HK, Schmid MW, Bender KW, Neeracher MJ, Fernández-Fernández AD, Santiago J, Zipfel C. Leveraging coevolutionary insights and AI-based structural modeling to unravel receptor-peptide ligand-binding mechanisms. Proc Natl Acad Sci U S A 2024; 121:e2400862121. [PMID: 39106311 PMCID: PMC11331138 DOI: 10.1073/pnas.2400862121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 07/05/2024] [Indexed: 08/09/2024] Open
Abstract
Secreted signaling peptides are central regulators of growth, development, and stress responses, but specific steps in the evolution of these peptides and their receptors are not well understood. Also, the molecular mechanisms of peptide-receptor binding are only known for a few examples, primarily owing to the limited availability of protein structural determination capabilities to few laboratories worldwide. Plants have evolved a multitude of secreted signaling peptides and corresponding transmembrane receptors. Stress-responsive SERINE RICH ENDOGENOUS PEPTIDES (SCOOPs) were recently identified. Bioactive SCOOPs are proteolytically processed by subtilases and are perceived by the leucine-rich repeat receptor kinase MALE DISCOVERER 1-INTERACTING RECEPTOR-LIKE KINASE 2 (MIK2) in the model plant Arabidopsis thaliana. How SCOOPs and MIK2 have (co)evolved, and how SCOOPs bind to MIK2 are unknown. Using in silico analysis of 350 plant genomes and subsequent functional testing, we revealed the conservation of MIK2 as SCOOP receptor within the plant order Brassicales. We then leveraged AI-based structural modeling and comparative genomics to identify two conserved putative SCOOP-MIK2 binding pockets across Brassicales MIK2 homologues predicted to interact with the "SxS" motif of otherwise sequence-divergent SCOOPs. Mutagenesis of both predicted binding pockets compromised SCOOP binding to MIK2, SCOOP-induced complex formation between MIK2 and its coreceptor BRASSINOSTEROID INSENSITIVE 1-ASSOCIATED KINASE 1, and SCOOP-induced reactive oxygen species production, thus, confirming our in silico predictions. Collectively, in addition to revealing the elusive SCOOP-MIK2 binding mechanism, our analytic pipeline combining phylogenomics, AI-based structural predictions, and experimental biochemical and physiological validation provides a blueprint for the elucidation of peptide ligand-receptor perception mechanisms.
Collapse
Affiliation(s)
- Simon Snoeck
- Department of Plant and Microbial Biology (IPMB), Zurich-Basel Plant Science Center, University of Zurich, Zurich8008, Switzerland
| | - Hyun Kyung Lee
- The Plant Signaling Mechanisms Laboratory, Department of Plant Molecular Biology, University of Lausanne, Lausanne1015, Switzerland
| | | | - Kyle W. Bender
- Department of Plant and Microbial Biology (IPMB), Zurich-Basel Plant Science Center, University of Zurich, Zurich8008, Switzerland
| | - Matthias J. Neeracher
- Department of Plant and Microbial Biology (IPMB), Zurich-Basel Plant Science Center, University of Zurich, Zurich8008, Switzerland
| | - Alvaro D. Fernández-Fernández
- Department of Plant and Microbial Biology (IPMB), Zurich-Basel Plant Science Center, University of Zurich, Zurich8008, Switzerland
| | - Julia Santiago
- The Plant Signaling Mechanisms Laboratory, Department of Plant Molecular Biology, University of Lausanne, Lausanne1015, Switzerland
| | - Cyril Zipfel
- Department of Plant and Microbial Biology (IPMB), Zurich-Basel Plant Science Center, University of Zurich, Zurich8008, Switzerland
- The Sainsbury Laboratory, University of East Anglia, Norwich Research Park, NorwichNR4 7UH, United Kingdom
| |
Collapse
|
3
|
Wang L, Guo S, Zeng B, Wang S, Chen Y, Cheng S, Liu B, Wang C, Wang Y, Meng Q. Draft Genome Assembly and Annotation for Cutaneotrichosporon dermatis NICC30027, an Oleaginous Yeast Capable of Simultaneous Glucose and Xylose Assimilation. MYCOBIOLOGY 2022; 50:69-81. [PMID: 35291590 PMCID: PMC8890563 DOI: 10.1080/12298093.2022.2038844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 01/10/2022] [Accepted: 02/02/2022] [Indexed: 06/14/2023]
Abstract
The identification of oleaginous yeast species capable of simultaneously utilizing xylose and glucose as substrates to generate value-added biological products is an area of key economic interest. We have previously demonstrated that the Cutaneotrichosporon dermatis NICC30027 yeast strain is capable of simultaneously assimilating both xylose and glucose, resulting in considerable lipid accumulation. However, as no high-quality genome sequencing data or associated annotations for this strain are available at present, it remains challenging to study the metabolic mechanisms underlying this phenotype. Herein, we report a 39,305,439 bp draft genome assembly for C. dermatis NICC30027 comprised of 37 scaffolds, with 60.15% GC content. Within this genome, we identified 524 tRNAs, 142 sRNAs, 53 miRNAs, 28 snRNAs, and eight rRNA clusters. Moreover, repeat sequences totaling 1,032,129 bp in length were identified (2.63% of the genome), as were 14,238 unigenes that were 1,789.35 bp in length on average (64.82% of the genome). The NCBI non-redundant protein sequences (NR) database was employed to successfully annotate 11,795 of these unigenes, while 3,621 and 11,902 were annotated with the Swiss-Prot and TrEMBL databases, respectively. Unigenes were additionally subjected to pathway enrichment analyses using the Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Cluster of Orthologous Groups of proteins (COG), Clusters of orthologous groups for eukaryotic complete genomes (KOG), and Non-supervised Orthologous Groups (eggNOG) databases. Together, these results provide a foundation for future studies aimed at clarifying the mechanistic basis for the ability of C. dermatis NICC30027 to simultaneously utilize glucose and xylose to synthesize lipids.
Collapse
Affiliation(s)
- Laiyou Wang
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Shuxian Guo
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Bo Zeng
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Shanshan Wang
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Yan Chen
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Shuang Cheng
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Bingbing Liu
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Chunyan Wang
- School of Biological and Chemical Engineering, Nanyang Institute of Technology, Nanyang, China
- Henan Key Laboratory of Industrial Microbial Resources and Fermentation Technology, Nanyang Institute of Technology, Nanyang, China
| | - Yu Wang
- College of Biological Science and Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Qingshan Meng
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
4
|
Das S, Scholes HM, Sen N, Orengo C. CATH functional families predict functional sites in proteins. Bioinformatics 2021; 37:1099-1106. [PMID: 33135053 PMCID: PMC8150129 DOI: 10.1093/bioinformatics/btaa937] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Revised: 09/30/2020] [Accepted: 10/27/2020] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). RESULTS FunSite's prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite's performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyze which structural and evolutionary features are most predictive for functional sites. AVAILABILITYAND IMPLEMENTATION https://github.com/UCL/cath-funsite-predictor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- PrecisionLife Ltd., Long Hanborough, OX29 8LJ Oxford, UK
| | - Harry M Scholes
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| |
Collapse
|
5
|
Aubailly S, Piazza F. Cutoff lensing: predicting catalytic sites in enzymes. Sci Rep 2015; 5:14874. [PMID: 26445900 PMCID: PMC4597221 DOI: 10.1038/srep14874] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2015] [Accepted: 09/10/2015] [Indexed: 01/12/2023] Open
Abstract
Predicting function-related amino acids in proteins with unknown function or unknown allosteric binding sites in drug-targeted proteins is a task of paramount importance in molecular biomedicine. In this paper we introduce a simple, light and computationally inexpensive structure-based method to identify catalytic sites in enzymes. Our method, termed cutoff lensing, is a general procedure consisting in letting the cutoff used to build an elastic network model increase to large values. A validation of our method against a large database of annotated enzymes shows that optimal values of the cutoff exist such that three different structure-based indicators allow one to recover a maximum of the known catalytic sites. Interestingly, we find that the larger the structures the greater the predictive power afforded by our method. Possible ways to combine the three indicators into a single figure of merit and into a specific sequential analysis are suggested and discussed with reference to the classic case of HIV-protease. Our method could be used as a complement to other sequence- and/or structure-based methods to narrow the results of large-scale screenings.
Collapse
Affiliation(s)
- Simon Aubailly
- Université d'Orléans, Centre de Biophysique Moléculaire, CNRS-UPR4301, Rue C. Sadron, 45071, Orléans, France
| | - Francesco Piazza
- Université d'Orléans, Centre de Biophysique Moléculaire, CNRS-UPR4301, Rue C. Sadron, 45071, Orléans, France
| |
Collapse
|
6
|
Abstract
Unravelling the genotype–phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein–protein interfaces and protein–ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at http://www.wasslab.org/varmod.
Collapse
Affiliation(s)
- Morena Pappalardo
- Centre for Molecular Processing, School of Biosciences, University of Kent, CT2 7NH, UK
| | - Mark N Wass
- Centre for Molecular Processing, School of Biosciences, University of Kent, CT2 7NH, UK
| |
Collapse
|
7
|
Phylogenetic Gaussian process model for the inference of functionally important regions in protein tertiary structures. PLoS Comput Biol 2014; 10:e1003429. [PMID: 24453956 PMCID: PMC3894161 DOI: 10.1371/journal.pcbi.1003429] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Accepted: 11/22/2013] [Indexed: 11/30/2022] Open
Abstract
A critical question in biology is the identification of functionally important amino acid sites in proteins. Because functionally important sites are under stronger purifying selection, site-specific substitution rates tend to be lower than usual at these sites. A large number of phylogenetic models have been developed to estimate site-specific substitution rates in proteins and the extraordinarily low substitution rates have been used as evidence of function. Most of the existing tools, e.g. Rate4Site, assume that site-specific substitution rates are independent across sites. However, site-specific substitution rates may be strongly correlated in the protein tertiary structure, since functionally important sites tend to be clustered together to form functional patches. We have developed a new model, GP4Rate, which incorporates the Gaussian process model with the standard phylogenetic model to identify slowly evolved regions in protein tertiary structures. GP4Rate uses the Gaussian process to define a nonparametric prior distribution of site-specific substitution rates, which naturally captures the spatial correlation of substitution rates. Simulations suggest that GP4Rate can potentially estimate site-specific substitution rates with a much higher accuracy than Rate4Site and tends to report slowly evolved regions rather than individual sites. In addition, GP4Rate can estimate the strength of the spatial correlation of substitution rates from the data. By applying GP4Rate to a set of mammalian B7-1 genes, we found a highly conserved region which coincides with experimental evidence. GP4Rate may be a useful tool for the in silico prediction of functionally important regions in the proteins with known structures. To understand how a protein functions, a critical step is to know which regions in its protein tertiary structure may be functionally important. Functionally important protein regions are typically more conserved than other regions because mutations in these regions are more likely to be deleterious. A number of phylogenetic models have been developed to identify conserved sites or regions in proteins by comparing protein sequences from multiple species. However, most of these methods treat amino acid sites independently and do not consider the spatial clustering of conserved sites in the protein tertiary structure. Therefore, their power of identifying functional protein regions is limited. We develop a new statistical model, GP4Rate, which combines the information from the protein sequences and the protein tertiary structure to infer conserved regions. We demonstrate that GP4Rate outperforms Rate4Site, the most widely used phylogenetic software for inferring functional amino acid sites, via simulations with a case study of B7-1 genes. GP4Rate is a potentially useful tool for guiding mutagenesis experiments or providing insights on the relationship between protein structures and functions.
Collapse
|
8
|
Nemoto W, Saito A, Oikawa H. Recent advances in functional region prediction by using structural and evolutionary information - Remaining problems and future extensions. Comput Struct Biotechnol J 2013; 8:e201308007. [PMID: 24688747 PMCID: PMC3962155 DOI: 10.5936/csbj.201308007] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2013] [Revised: 11/12/2013] [Accepted: 11/13/2013] [Indexed: 11/22/2022] Open
Abstract
Structural genomics projects have solved many new structures with unknown functions. One strategy to investigate the function of a structure is to computationally find the functionally important residues or regions on it. Therefore, the development of functional region prediction methods has become an important research subject. An effective approach is to use a method employing structural and evolutionary information, such as the evolutionary trace (ET) method. ET ranks the residues of a protein structure by calculating the scores for relative evolutionary importance, and locates functionally important sites by identifying spatial clusters of highly ranked residues. After ET was developed, numerous ET-like methods were subsequently reported, and many of them are in practical use, although they require certain conditions. In this mini review, we first introduce the remaining problems and the recent improvements in the methods using structural and evolutionary information. We then summarize the recent developments of the methods. Finally, we conclude by describing possible extensions of the evolution- and structure-based methods.
Collapse
Affiliation(s)
- Wataru Nemoto
- Division of Life Science and Engineering, School of Science and Engineering, Tokyo Denki University (TDU), Ishizaka, Hatoyama-cho, Hiki-gun, Saitama, 350-0394, Japan
| | - Akira Saito
- Division of Life Science and Engineering, School of Science and Engineering, Tokyo Denki University (TDU), Ishizaka, Hatoyama-cho, Hiki-gun, Saitama, 350-0394, Japan
| | - Hayato Oikawa
- Division of Life Science and Engineering, School of Science and Engineering, Tokyo Denki University (TDU), Ishizaka, Hatoyama-cho, Hiki-gun, Saitama, 350-0394, Japan
| |
Collapse
|
9
|
Manoharan M, Sankar K, Offmann B, Ramanathan S. Association of Putative Members to Family of Mosquito Odorant Binding Proteins: Scoring Scheme Using Fuzzy Functional Templates and Cys Residue Positions. Bioinform Biol Insights 2013; 7:231-51. [PMID: 23908587 PMCID: PMC3728099 DOI: 10.4137/bbi.s11096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Proteins may be related to each other very specifically as homologous subfamilies. Proteins can also be related to diverse proteins at the super family level. It has become highly important to characterize the existing sequence databases by their signatures to facilitate the function annotation of newly added sequences. The algorithm described here uses a scheme for the classification of odorant binding proteins on the basis of functional residues and Cys-pairing. The cysteine-based scoring scheme not only helps in unambiguously identifying families like odorant binding proteins (OBPs), but also aids in their classification at the subfamily level with reliable accuracy. The algorithm was also applied to yet another cysteine-rich family, where similar accuracy was observed that ensures the application of the protocol to other families.
Collapse
Affiliation(s)
- Malini Manoharan
- Université de La Reunion, DSIMB, INSERM UMR-S 665, La Reunion, France
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
- Manipal University, Madhav Nagar, Manipal, Karnataka, India
| | - Kannan Sankar
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
- Birla Institute of Technology, Pilani, Rajasthan, India
- Current address: Iowa State University, Ames, IA, USA
| | - Bernard Offmann
- Université de La Reunion, DSIMB, INSERM UMR-S 665, La Reunion, France
- Université de Nantes, UFIP CNRS FRE 3478, Nantes, France
| | - Sowdhamini Ramanathan
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
| |
Collapse
|
10
|
Nemoto W, Toh H. Functional region prediction with a set of appropriate homologous sequences--an index for sequence selection by integrating structure and sequence information with spatial statistics. BMC STRUCTURAL BIOLOGY 2012; 12:11. [PMID: 22643026 PMCID: PMC3533907 DOI: 10.1186/1472-6807-12-11] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2011] [Accepted: 04/19/2012] [Indexed: 11/17/2022]
Abstract
Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems.
Collapse
Affiliation(s)
- Wataru Nemoto
- Computational Biology Research Center (CBRC), Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
11
|
Dou Y, Wang J, Yang J, Zhang C. L1pred: a sequence-based prediction tool for catalytic residues in enzymes with the L1-logreg classifier. PLoS One 2012; 7:e35666. [PMID: 22558194 PMCID: PMC3338704 DOI: 10.1371/journal.pone.0035666] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 03/19/2012] [Indexed: 12/01/2022] Open
Abstract
To understand enzyme functions, identifying the catalytic residues is a usual first step. Moreover, knowledge about catalytic residues is also useful for protein engineering and drug-design. However, to experimentally identify catalytic residues remains challenging for reasons of time and cost. Therefore, computational methods have been explored to predict catalytic residues. Here, we developed a new algorithm, L1pred, for catalytic residue prediction, by using the L1-logreg classifier to integrate eight sequence-based scoring functions. We tested L1pred and compared it against several existing sequence-based methods on carefully designed datasets Data604 and Data63. With ten-fold cross-validation, L1pred showed the area under precision-recall curve (AUPR) and the area under ROC curve (AUC) of 0.2198 and 0.9494 on the training dataset, Data604, respectively. In addition, on the independent test dataset, Data63, it showed the AUPR and AUC values of 0.2636 and 0.9375, respectively. Compared with other sequence-based methods, L1pred showed the best performance on both datasets. We also analyzed the importance of each attribute in the algorithm, and found that all the scores contributed more or less equally to the L1pred performance.
Collapse
Affiliation(s)
- Yongchao Dou
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
| | - Jun Wang
- Scientific Computing Key Laboratory of Shanghai Universities, Shanghai, People’s Republic of China
- Department of Mathematics, Shanghai Normal University, Shanghai, People’s Republic of China
| | - Jialiang Yang
- MPI-Institute of Computational Biology, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Chi Zhang
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
- * E-mail:
| |
Collapse
|
12
|
LRR conservation mapping to predict functional sites within protein leucine-rich repeat domains. PLoS One 2011; 6:e21614. [PMID: 21789174 PMCID: PMC3138743 DOI: 10.1371/journal.pone.0021614] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2011] [Accepted: 06/03/2011] [Indexed: 11/19/2022] Open
Abstract
Computational prediction of protein functional sites can be a critical first step for analysis of large or complex proteins. Contemporary methods often require several homologous sequences and/or a known protein structure, but these resources are not available for many proteins. Leucine-rich repeats (LRRs) are ligand interaction domains found in numerous proteins across all taxonomic kingdoms, including immune system receptors in plants and animals. We devised Repeat Conservation Mapping (RCM), a computational method that predicts functional sites of LRR domains. RCM utilizes two or more homologous sequences and a generic representation of the LRR structure to identify conserved or diversified patches of amino acids on the predicted surface of the LRR. RCM was validated using solved LRR+ligand structures from multiple taxa, identifying ligand interaction sites. RCM was then used for de novo dissection of two plant microbe-associated molecular pattern (MAMP) receptors, EF-TU RECEPTOR (EFR) and FLAGELLIN-SENSING 2 (FLS2). In vivo testing of Arabidopsis thaliana EFR and FLS2 receptors mutagenized at sites identified by RCM demonstrated previously unknown functional sites. The RCM predictions for EFR, FLS2 and a third plant LRR protein, PGIP, compared favorably to predictions from ODA (optimal docking area), Consurf, and PAML (positive selection) analyses, but RCM also made valid functional site predictions not available from these other bioinformatic approaches. RCM analyses can be conducted with any LRR-containing proteins at www.plantpath.wisc.edu/RCM, and the approach should be modifiable for use with other types of repeat protein domains.
Collapse
|
13
|
Dou Y, Geng X, Gao H, Yang J, Zheng X, Wang J. Sequence Conservation in the Prediction of Catalytic Sites. Protein J 2011; 30:229-39. [PMID: 21465136 DOI: 10.1007/s10930-011-9324-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
14
|
Sonavane S, Chakrabarti P. Prediction of active site cleft using support vector machines. J Chem Inf Model 2010; 50:2266-73. [PMID: 21080689 DOI: 10.1021/ci1002922] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Computational tools are available today for the detection and delineation of the clefts and cavities in protein 3D structure and ranking them on the basis of probable binding site clefts. There is a need to improve the ranking of clefts and accuracy of predicting catalytic site clefts. Our results show that the distance of the clefts from protein centroid and sequence entropy of the lining residues, when used in conjunction with the volume, are valuable descriptors for predicting the catalytic site. We have applied the SVM approach for recognizing and ranking the active site clefts and tested its performance using different combinations of attributes. In both the ligand-bound and the unbound forms of structures, our method correctly predicts the active site clefts in 73% of cases at rank one. If we consider the results at rank 3 (i.e., the correct solution is among one of the top three solutions), the correctly predicted cases are 94% and 90% for the bound and the unbound forms of structures, respectively. Our approach improves the ranking of binding site clefts in comparison with CASTp and is comparable to other existing methods like Fpocket. Although the data set for training the SVM approach is rather small in size, the results are encouraging for the method to be used as complementary to other existing tools.
Collapse
Affiliation(s)
- Shrihari Sonavane
- Department of Biochemistry and Bioinformatics Centre, Bose Institute, P-1/12 CIT Scheme VIIM, Kolkata 700 054, India
| | | |
Collapse
|
15
|
Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification. PLoS Comput Biol 2010; 6:e1000978. [PMID: 21079665 PMCID: PMC2973806 DOI: 10.1371/journal.pcbi.1000978] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2010] [Accepted: 09/27/2010] [Indexed: 11/19/2022] Open
Abstract
Identification of catalytic residues (CR) is essential for the characterization of enzyme function. CR are, in general, conserved and located in the functional site of a protein in order to attain their function. However, many non-catalytic residues are highly conserved and not all CR are conserved throughout a given protein family making identification of CR a challenging task. Here, we put forward the hypothesis that CR carry a particular signature defined by networks of close proximity residues with high mutual information (MI), and that this signature can be applied to distinguish functional from other non-functional conserved residues. Using a data set of 434 Pfam families included in the catalytic site atlas (CSA) database, we tested this hypothesis and demonstrated that MI can complement amino acid conservation scores to detect CR. The Kullback-Leibler (KL) conservation measurement was shown to significantly outperform both the Shannon entropy and maximal frequency measurements. Residues in the proximity of catalytic sites were shown to be rich in shared MI. A structural proximity MI average score (termed pMI) was demonstrated to be a strong predictor for CR, thus confirming the proposed hypothesis. A structural proximity conservation average score (termed pC) was also calculated and demonstrated to carry distinct information from pMI. A catalytic likeliness score (Cls), combining the KL, pC and pMI measures, was shown to lead to significantly improved prediction accuracy. At a specificity of 0.90, the Cls method was found to have a sensitivity of 0.816. In summary, we demonstrate that networks of residues with high MI provide a distinct signature on CR and propose that such a signature should be present in other classes of functional residues where the requirement to maintain a particular function places limitations on the diversification of the structural environment along the course of evolution.
Collapse
|
16
|
Nagao C, Nagano N, Mizuguchi K. Relationships between functional subclasses and information contained in active-site and ligand-binding residues in diverse superfamilies. Proteins 2010; 78:2369-84. [PMID: 20544971 DOI: 10.1002/prot.22750] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
To investigate the relationships between functional subclasses and sequence and structural information contained in the active-site and ligand-binding residues (LBRs), we performed a detailed analysis of seven diverse enzyme superfamilies: aldolase class I, TIM-barrel glycosidases, alpha/beta-hydrolases, P-loop containing nucleotide triphosphate hydrolases, collagenase, Zn peptidases, and glutamine phosphoribosylpyrophosphate, subunit 1, domain 1. These homologous superfamilies, as defined in CATH, were selected from the enzyme catalytic-mechanism database. We defined active-site and LBRs based solely on the literature information and complex structures in the Protein Data Bank. From a structure-based multiple sequence alignment for each CATH homologous superfamily, we extracted subsequences consisting of the aligned positions that were used as an active-site or a ligand-binding site by at least one sequence. Using both the subsequences and full-length alignments, we performed cluster analysis with three sequence distance measures. We showed that the cluster analysis using the subsequences was able to detect functional subclasses more accurately than the clustering using the full-length alignments. The subsequences determined by only the literature information and complex structures, thus, had sufficient information to detect the functional subclasses. Detailed examination of the clustering results provided new insights into the mechanism of functional diversification for these superfamilies.
Collapse
Affiliation(s)
- Chioko Nagao
- National Institute of Biomedical Innovation, 7-6-8 Saito-Asagi, Ibaraki, Osaka 567-0085, Japan
| | | | | |
Collapse
|
17
|
Prediction of catalytic residues based on an overlapping amino acid classification. Amino Acids 2010; 39:1353-61. [DOI: 10.1007/s00726-010-0587-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2009] [Accepted: 03/27/2010] [Indexed: 10/19/2022]
|
18
|
Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjölander K. Active site prediction using evolutionary and structural information. ACTA ACUST UNITED AC 2010; 26:617-24. [PMID: 20080507 PMCID: PMC2828116 DOI: 10.1093/bioinformatics/btq008] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Motivation: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites. Results: In cross-validation experiments on two benchmark datasets from the Catalytic Site Atlas and CATRES resources containing a total of 437 manually curated enzymes spanning 487 SCOP families, Discern increases catalytic site recall between 12% and 20% over methods that combine information from both sequence and structure, and by ≥50% over methods that make use of sequence conservation signal only. Controlled experiments show that Discern's improvement in catalytic residue prediction is derived from the combination of three ingredients: the use of the INTREPID phylogenomic method to extract conservation information; the use of 3D structure data, including features computed for residues that are proximal in the structure; and a statistical regularization procedure to prevent overfitting. Contact:kimmen@berkeley.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
19
|
Dou Y, Zheng X, Wang J. Several appropriate background distributions for entropy-based protein sequence conservation measures. J Theor Biol 2009; 262:317-22. [PMID: 19808039 DOI: 10.1016/j.jtbi.2009.09.030] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2009] [Revised: 09/25/2009] [Accepted: 09/25/2009] [Indexed: 11/25/2022]
Abstract
Amino acid background distribution is an important factor for entropy-based methods which extract sequence conservation information from protein multiple sequence alignments (MSAs). However, MSAs are usually not large enough to allow a reliable observed background distribution. In this paper, we propose two new estimations of background distribution. One is an integration of the observed background distribution and the position-specific residue distribution, and the other is a normalized square root of observed background frequency. To validate these new background distributions, they are applied to the relative entropy model to find catalytic sites and ligand binding sites from protein MSAs. Experimental results show that they are superior to the observed background distribution in predicting functionally important residues.
Collapse
Affiliation(s)
- Yongchao Dou
- School of Mathematical Science, Dalian University of Technology, Dalian 116024, PR China
| | | | | |
Collapse
|
20
|
Nimrod G, Schushan M, Steinberg DM, Ben-Tal N. Detection of functionally important regions in "hypothetical proteins" of known structure. Structure 2009; 16:1755-63. [PMID: 19081051 DOI: 10.1016/j.str.2008.10.017] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2008] [Revised: 10/16/2008] [Accepted: 10/19/2008] [Indexed: 10/21/2022]
Abstract
Structural genomics initiatives provide ample structures of "hypothetical proteins" (i.e., proteins of unknown function) at an ever increasing rate. However, without function annotation, this structural goldmine is of little use to biologists who are interested in particular molecular systems. To this end, we used (an improved version of) the PatchFinder algorithm for the detection of functional regions on the protein surface, which could mediate its interactions with, e.g., substrates, ligands, and other proteins. Examination, using a data set of annotated proteins, showed that PatchFinder outperforms similar methods. We collected 757 structures of hypothetical proteins and their predicted functional regions in the N-Func database. Inspection of several of these regions demonstrated that they are useful for function prediction. For example, we suggested an interprotein interface and a putative nucleotide-binding site. A web-server implementation of PatchFinder and the N-Func database are available at http://patchfinder.tau.ac.il/.
Collapse
Affiliation(s)
- Guy Nimrod
- Department of Biochemistry, George S. Wise Faculty of Life Sciences, Tel Aviv University, 69978 Tel Aviv, Israel
| | | | | | | |
Collapse
|
21
|
Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L. Bridging protein local structures and protein functions. Amino Acids 2008; 35:627-50. [PMID: 18421562 PMCID: PMC7088341 DOI: 10.1007/s00726-008-0088-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 03/10/2008] [Indexed: 12/11/2022]
Abstract
One of the major goals of molecular and evolutionary biology is to understand the functions of proteins by extracting functional information from protein sequences, structures and interactions. In this review, we summarize the repertoire of methods currently being applied and report recent progress in the field of in silico annotation of protein function based on the accumulation of vast amounts of sequence and structure data. In particular, we emphasize the newly developed structure-based methods, which are able to identify locally structural motifs and reveal their relationship with protein functions. These methods include computational tools to identify the structural motifs and reveal the strong relationship between these pre-computed local structures and protein functions. We also discuss remaining problems and possible directions for this exciting and challenging area.
Collapse
Affiliation(s)
- Zhi-Ping Liu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080, Beijing, China
| | | | | | | | | |
Collapse
|
22
|
Manikandan K, Pal D, Ramakumar S, Brener NE, Iyengar SS, Seetharaman G. Functionally important segments in proteins dissected using Gene Ontology and geometric clustering of peptide fragments. Genome Biol 2008; 9:R52. [PMID: 18331637 PMCID: PMC2397504 DOI: 10.1186/gb-2008-9-3-r52] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Revised: 02/24/2008] [Accepted: 03/10/2008] [Indexed: 11/25/2022] Open
Abstract
A geometric clustering algorithm has been developed to dissect protein fragments based on their relevance to function. We have developed a geometric clustering algorithm using backbone φ,ψ angles to group conformationally similar peptide fragments of any length. By labeling each fragment in the cluster with the level-specific Gene Ontology 'molecular function' term of its protein, we are able to compute statistics for molecular function-propensity and p-value of individual fragments in the cluster. Clustering-cum-statistical analysis for peptide fragments 8 residues in length and with only trans peptide bonds shows that molecular function propensities ≥20 and p-values ≤0.05 can dissect fragments within a protein linked to the molecular function.
Collapse
|
23
|
Tong W, Williams RJ, Wei Y, Murga LF, Ko J, Ondrechen MJ. Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci 2007; 17:333-41. [PMID: 18096640 DOI: 10.1110/ps.073213608] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Theoretical microscopic titration curves (THEMATICS) is a computational method for the identification of active sites in proteins through deviations in computed titration behavior of ionizable residues. While the sensitivity to catalytic sites is high, the previously reported sensitivity to catalytic residues was not as high, about 50%. Here THEMATICS is combined with support vector machines (SVM) to improve sensitivity for catalytic residue prediction from protein 3D structure alone. For a test set of 64 proteins taken from the Catalytic Site Atlas (CSA), the average recall rate for annotated catalytic residues is 61%; good precision is maintained selecting only 4% of all residues. The average false positive rate, using the CSA annotations is only 3.2%, far lower than other 3D-structure-based methods. THEMATICS-SVM returns higher precision, lower false positive rate, and better overall performance, compared with other 3D-structure-based methods. Comparison is also made with the latest machine learning methods that are based on both sequence alignments and 3D structures. For annotated sets of well-characterized enzymes, THEMATICS-SVM performance compares very favorably with methods that utilize sequence homology. However, since THEMATICS depends only on the 3D structure of the query protein, no decline in performance is expected when applied to novel folds, proteins with few sequence homologues, or even orphan sequences. An extension of the method to predict non-ionizable catalytic residues is also presented. THEMATICS-SVM predicts a local network of ionizable residues with strong interactions between protonation events; this appears to be a special feature of enzyme active sites.
Collapse
Affiliation(s)
- Wenxu Tong
- College of Computer and Information Science, Northeastern University, Boston, Massachusetts 02115, USA
| | | | | | | | | | | |
Collapse
|
24
|
Dunning FM, Sun W, Jansen KL, Helft L, Bent AF. Identification and mutational analysis of Arabidopsis FLS2 leucine-rich repeat domain residues that contribute to flagellin perception. THE PLANT CELL 2007; 19:3297-313. [PMID: 17933906 PMCID: PMC2174712 DOI: 10.1105/tpc.106.048801] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2006] [Revised: 09/13/2007] [Accepted: 09/19/2007] [Indexed: 05/19/2023]
Abstract
Mutational, phylogenetic, and structural modeling approaches were combined to develop a general method to study leucine-rich repeat (LRR) domains and were used to identify residues within the Arabidopsis thaliana FLAGELLIN-SENSING2 (FLS2) LRR that contribute to flagellin perception. FLS2 is a transmembrane receptor kinase that binds bacterial flagellin or a flagellin-based flg22 peptide through a presumed physical interaction within the FLS2 extracellular domain. Double-Ala scanning mutagenesis of solvent-exposed beta-strand/beta-turn residues across the FLS2 LRR domain identified LRRs 9 to 15 as contributors to flagellin responsiveness. FLS2 LRR-encoding domains from 15 Arabidopsis ecotypes and 20 diverse Brassicaceae accessions were isolated and sequenced. FLS2 is highly conserved across most Arabidopsis ecotypes, whereas more diversified functional FLS2 homologs were found in many but not all Brassicaceae accessions. flg22 responsiveness was correlated with conserved LRR regions using Conserved Functional Group software to analyze structural models of the LRR for diverse FLS2 proteins. This identified conserved spatial clusters of residues across the beta-strand/beta-turn residues of LRRs 12 to 14, the same area identified by the Ala scan, as well as other conserved sites. Site-directed randomizing mutagenesis of solvent-exposed beta-strand/beta-turn residues across LRRs 9 to 15 identified mutations that disrupt flg22 binding and showed that flagellin perception is dependent on a limited number of tightly constrained residues of LRRs 9 to 15 that make quantitative contributions to the overall phenotypic response.
Collapse
Affiliation(s)
- F Mark Dunning
- Department of Plant Pathology, University of Wisconsin, Madison, Wisconsin 53706
| | | | | | | | | |
Collapse
|
25
|
Innis CA. siteFiNDER|3D: a web-based tool for predicting the location of functional sites in proteins. Nucleic Acids Res 2007; 35:W489-94. [PMID: 17553829 PMCID: PMC1933183 DOI: 10.1093/nar/gkm422] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Although knowledge of a protein's functional site is a key requirement for understanding its mode of action at the molecular level, our ability to locate such sites experimentally is far exceeded by the rate at which sequence and structural information is being accumulated. siteFiNDER|3D is an online tool for the prediction of functionally important regions in proteins of known structure. At the core of the server lies the CFG analysis algorithm, which uses a moving 3D window to correlate patterns of functional/chemical group conservation in the query protein with the location of functional sites. Here, we give a general overview of the functionality offered by the siteFiNDER|3D server, along with general recommendations aimed at maximizing the accuracy and predictive value of this tool in a variety of contexts. siteFiNDER|3D can be accessed at: ‘http://sage.csb.yale.edu/sitefinder3d’ and requires, at a minimum, the atomic coordinates of a query protein in PDB format.
Collapse
Affiliation(s)
- C Axel Innis
- Howard Hughes Medical Institute/Yale University, Department of Molecular Biophysics and Biochemistry, New Haven, CT 06520-8114, USA.
| |
Collapse
|
26
|
How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics 2007; 8:153. [PMID: 17498304 PMCID: PMC1876251 DOI: 10.1186/1471-2105-8-153] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 05/11/2007] [Indexed: 11/25/2022] Open
Abstract
Background We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex i and all other vertices. Results We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined. Conclusion Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.
Collapse
|
27
|
Relating destabilizing regions to known functional sites in proteins. BMC Bioinformatics 2007; 8:141. [PMID: 17470296 PMCID: PMC1890302 DOI: 10.1186/1471-2105-8-141] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2006] [Accepted: 04/30/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Most methods for predicting functional sites in protein 3D structures, rely on information on related proteins and cannot be applied to proteins with no known relatives. Another limitation of these methods is the lack of a well annotated set of functional sites to use as benchmark for validating their predictions. Experimental findings and theoretical considerations suggest that residues involved in function often contribute unfavorably to the native state stability. We examine the possibility of systematically exploiting this intrinsic property to identify functional sites using an original procedure that detects destabilizing regions in protein structures. In addition, to relate destabilizing regions to known functional sites, a novel benchmark consisting of a diverse set of hand-curated protein functional sites is derived. RESULTS A procedure for detecting clusters of destabilizing residues in protein structures is presented. Individual residue contributions to protein stability are evaluated using detailed atomic models and a force-field successfully applied in computational protein design. The most destabilizing residues, and some of their closest neighbours, are clustered into destabilizing regions following a rigorous protocol. Our procedure is applied to high quality apo-structures of 63 unrelated proteins. The biologically relevant binding sites of these proteins were annotated using all available information, including structural data and literature curation, resulting in the largest hand-curated dataset of binding sites in proteins available to date. Comparing the destabilizing regions with the annotated binding sites in these proteins, we find that the overlap is on average limited, but significantly better than random. Results depend on the type of bound ligand. Significant overlap is obtained for most polysaccharide- and small ligand-binding sites, whereas no overlap is observed for most nucleic acid binding sites. These differences are rationalised in terms of the geometry and energetics of the binding site. CONCLUSION We find that although destabilizing regions as detected here can in general not be used to predict binding sites in protein structures, they can provide useful information, particularly on the location of functional sites that bind polysaccharides and small ligands. This information can be exploited in methods for predicting function in protein structures with no known relatives. Our publicly available benchmark of hand-curated functional sites in proteins should help other workers derive and validate new prediction methods.
Collapse
|
28
|
Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007; 8:119. [PMID: 17419878 PMCID: PMC1877815 DOI: 10.1186/1471-2105-8-119] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2006] [Accepted: 04/09/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Methods are now available for the prediction of interaction sites in protein 3D structures. While many of these methods report high success rates for site prediction, often these predictions are not very selective and have low precision. Precision in site prediction is addressed using Theoretical Microscopic Titration Curves (THEMATICS), a simple computational method for the identification of active sites in enzymes. Recall and precision are measured and compared with other methods for the prediction of catalytic sites. RESULTS Using a test set of 169 enzymes from the original Catalytic Residue Dataset (CatRes) it is shown that THEMATICS can deliver precise, localised site predictions. Furthermore, adjustment of the cut-off criteria can improve the recall rates for catalytic residues with only a small sacrifice in precision. Recall rates for CatRes/CSA annotated catalytic residues are 41.1%, 50.4%, and 54.2% for Z score cut-off values of 1.00, 0.99, and 0.98, respectively. The corresponding precision rates are 19.4%, 17.9%, and 16.4%. The success rate for catalytic sites is higher, with correct or partially correct predictions for 77.5%, 85.8%, and 88.2% of the enzymes in the test set, corresponding to the same respective Z score cut-offs, if only the CatRes annotations are used as the reference set. Incorporation of additional literature annotations into the reference set gives total success rates of 89.9%, 92.9%, and 94.1%, again for corresponding cut-off values of 1.00, 0.99, and 0.98. False positive rates for a 75-protein test set are 1.95%, 2.60%, and 3.12% for Z score cut-offs of 1.00, 0.99, and 0.98, respectively. CONCLUSION With a preferred cut-off value of 0.99, THEMATICS achieves a high success rate of interaction site prediction, about 86% correct or partially correct using CatRes/CSA annotations only and about 93% with an expanded reference set. Success rates for catalytic residue prediction are similar to those of other structure-based methods, but with substantially better precision and lower false positive rates. THEMATICS performs well across the spectrum of E.C. classes. The method requires only the structure of the query protein as input. THEMATICS predictions may be obtained via the web from structures in PDB format at: http://pfweb.chem.neu.edu/thematics/submit.html.
Collapse
|
29
|
Abstract
The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation.
Collapse
Affiliation(s)
- Saikat Chakrabarti
- National Center for Biotechnology Information, National Libary of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|
30
|
Affiliation(s)
- N Srinivasan
- Molecular Biophysics Unit; Indian Institute of Science; Bangalore 560 012; India
| |
Collapse
|
31
|
Mayer KM, McCorkle SR, Shanklin J. Linking enzyme sequence to function using Conserved Property Difference Locator to identify and annotate positions likely to control specific functionality. BMC Bioinformatics 2005; 6:284. [PMID: 16318626 PMCID: PMC1326233 DOI: 10.1186/1471-2105-6-284] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Accepted: 11/30/2005] [Indexed: 11/21/2022] Open
Abstract
Background Families of homologous enzymes evolved from common progenitors. The availability of multiple sequences representing each activity presents an opportunity for extracting information specifying the functionality of individual homologs. We present a straightforward method for the identification of residues likely to determine class specific functionality in which multiple sequence alignments are converted to an annotated graphical form by the Conserved Property Difference Locator (CPDL) program. Results Three test cases, each comprised of two groups of funtionally-distinct homologs, are presented. Of the test cases, one is a membrane and two are soluble enzyme families. The desaturase/hydroxylase data was used to design and test the CPDL algorithm because a comparative sequence approach had been successfully applied to manipulate the specificity of these enzymes. The other two cases, ATP/GTP cyclases, and MurD/MurE synthases were chosen because they are well characterized structurally and biochemically. For the desaturase/hydroxylase enzymes, the ATP/GTP cyclases and the MurD/MurE synthases, groups of 8 (of ~400), 4 (of ~150) and 10 (of >400) residues, respectively, of interest were identified that contain empirically defined specificity determining positions. Conclusion CPDL consistently identifies positions near enzyme active sites that include those predicted from structural and/or biochemical studies to be important for specificity and/or function. This suggests that CPDL will have broad utility for the identification of potential class determining residues based on multiple sequence analysis of groups of homologous proteins. Because the method is sequence, rather than structure, based it is equally well suited for designing structure-function experiments to investigate membrane and soluble proteins.
Collapse
Affiliation(s)
- Kimberly M Mayer
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Sean R McCorkle
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - John Shanklin
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| |
Collapse
|
32
|
Glaser F, Morris RJ, Najmanovich RJ, Laskowski RA, Thornton JM. A method for localizing ligand binding pockets in protein structures. Proteins 2005; 62:479-88. [PMID: 16304646 DOI: 10.1002/prot.20769] [Citation(s) in RCA: 141] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The accurate identification of ligand binding sites in protein structures can be valuable in determining protein function. Once the binding site is known, it becomes easier to perform in silico and experimental procedures that may allow the ligand type and the protein function to be determined. For example, binding pocket shape analysis relies heavily on the correct localization of the ligand binding site. We have developed SURFNET-ConSurf, a modular, two-stage method for identifying the location and shape of potential ligand binding pockets in protein structures. In the first stage, the SURFNET program identifies clefts in the protein surface that are potential binding sites. In the second stage, these clefts are trimmed in size by cutting away regions distant from highly conserved residues, as defined by the ConSurf-HSSP database. The largest clefts that remain tend to be those where ligands bind. To test the approach, we analyzed a nonredundant set of 244 protein structures from the PDB and found that SURFNET-ConSurf identifies a ligand binding pocket in 75% of them. The trimming procedure reduces the original cleft volumes by 30% on average, while still encompassing an average 87% of the ligand volume. From the analysis of the results we conclude that for those cases in which the ligands are found in large, highly conserved clefts, the combined SURFNET-ConSurf method gives pockets that are a better match to the ligand shape and location. We also show that this approach works better for enzymes than for nonenzyme proteins.
Collapse
Affiliation(s)
- Fabian Glaser
- European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
| | | | | | | | | |
Collapse
|
33
|
Pei J, Cai W, Kinch LN, Grishin NV. Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics 2005; 22:164-71. [PMID: 16278237 DOI: 10.1093/bioinformatics/bti766] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION A number of methods have been developed to predict functional specificity determinants in protein families based on sequence information. Most of these methods rely on pre-defined functional subgroups. Manual subgroup definition is difficult because of the limited number of experimentally characterized subfamilies with differing specificity, while automatic subgroup partitioning using computational tools is a non-trivial task and does not always yield ideal results. RESULTS We propose a new approach SPEL (specificity positions by evolutionary likelihood) to detect positions that are likely to be functional specificity determinants. SPEL, which does not require subgroup definition, takes a multiple sequence alignment of a protein family as the only input, and assigns a P-value to every position in the alignment. Positions with low P-values are likely to be important for functional specificity. An evolutionary tree is reconstructed during the calculation, and P-value estimation is based on a random model that involves evolutionary simulations. Evolutionary log-likelihood is chosen as a measure of amino acid distribution at a position. To illustrate the performance of the method, we carried out a detailed analysis of two protein families (LacI/PurR and G protein alpha subunit), and compared our method with two existing methods (evolutionary trace and mutual information based). All three methods were also compared on a set of protein families with known ligand-bound structures. AVAILABILITY SPEL is freely available for non-commercial use. Its pre-compiled versions for several platforms and alignments used in this work are available at ftp://iole.swmed.edu/pub/SPEL/
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA
| | | | | | | |
Collapse
|
34
|
Minshull J, Ness JE, Gustafsson C, Govindarajan S. Predicting enzyme function from protein sequence. Curr Opin Chem Biol 2005; 9:202-9. [PMID: 15811806 DOI: 10.1016/j.cbpa.2005.02.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
There are two main reasons to try to predict an enzyme's function from its sequence. The first is to identify the components and thus the functional capabilities of an organism, the second is to create enzymes with specific properties. Genomics, expression analysis, proteomics and metabonomics are largely directed towards understanding how information flows from DNA sequence to protein functions within an organism. This review focuses on information flow in the opposite direction: the applicability of what is being learned from natural enzymes to improve methods for catalyst design.
Collapse
|
35
|
Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural data. Curr Opin Struct Biol 2005; 15:275-84. [PMID: 15963890 DOI: 10.1016/j.sbi.2005.04.003] [Citation(s) in RCA: 203] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2005] [Revised: 02/04/2005] [Accepted: 04/18/2005] [Indexed: 10/25/2022]
Abstract
When a protein's function cannot be experimentally determined, it can often be inferred from sequence similarity. Should this process fail, analysis of the protein structure can provide functional clues or confirm tentative functional assignments inferred from the sequence. Many structure-based approaches exist (e.g. fold similarity, three-dimensional templates), but as no single method can be expected to be successful in all cases, a more prudent approach involves combining multiple methods. Several automated servers that integrate evidence from multiple sources have been released this year and particular improvements have been seen with methods utilizing the Gene Ontology functional annotation schema.
Collapse
Affiliation(s)
- James D Watson
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
36
|
Greaves R, Warwicker J. Active site identification through geometry-based and sequence profile-based calculations: burial of catalytic clefts. J Mol Biol 2005; 349:547-57. [PMID: 15882869 DOI: 10.1016/j.jmb.2005.04.018] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2005] [Revised: 03/30/2005] [Accepted: 04/08/2005] [Indexed: 12/30/2022]
Abstract
Electrostatics calculations with proteins that are uniformly charged over volume can aid enzyme/non-enzyme discrimination. For known enzymes, such methods locate active sites to within 5% on the enzyme surface, in 77% of a test set. We now report that removing the dielectric boundary improves active site location to 80%, with optimal discrimination between enzymes and non-enzymes of around 80% specificity and 80% sensitivity. This calculation quantifies burial of solvent-accessible regions. Many of the true enzymes incorrectly assigned as non-enzymes have active sites at subunit boundaries. These are missed in monomer-based calculations. Catalytic and non-catalytic antibodies are studied in this context of active/binding site burial. Whilst catalytic antibodies, on average, have marginally higher active site burial than non-catalytic antibodies, these values are generally smaller than for non-antibody enzymes, possibly contributing to their relatively low turnover. Prediction of active site location improves further when sequence profile-based weights replace the uniform charge distribution, so that a combination of burial and amino acid conservation is assessed. Accuracy rises to 93% of active sites to within 5%, in the test set, for the optimal profile weights scheme. The equivalent value in a separate validation set is 89% to within 5%. Enzyme/non-enzyme and enzyme functional site predictions are made for structural genomics proteins, suggesting that a substantial majority of these are non-enzymes.
Collapse
Affiliation(s)
- Richard Greaves
- Faculty of Life Sciences, Jackson's Mill, University of Manchester, P.O. Box 88, Sackville Street, Manchester M60 1QD, UK
| | | |
Collapse
|
37
|
Varrazzo D, Bernini A, Spiga O, Ciutti A, Chiellini S, Venditti V, Bracci L, Niccolai N. Three-dimensional computation of atom depth in complex molecular structures. Bioinformatics 2005; 21:2856-60. [PMID: 15827080 DOI: 10.1093/bioinformatics/bti444] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION For a complex molecular system the delineation of atom-atom contacts, exposed surface and binding sites represents a fundamental step to predict its interaction with solvent, ligands and other molecules. Recently, atom depth has been also considered as an additional structural descriptor to correlate protein structure with folding and functional properties. The distance between an atom and the nearest water molecule or the closest surface dot has been proposed as a measure of the atom depth, but, in both cases, the 3D character of depth is largely lost. In the present study, a new approach is proposed to calculate atom depths in a way that the molecular shape can be taken into account. RESULTS An algorithm has been developed to calculate intersections between the molecular volume and spheres centered on the atoms whose depth has to be quantified. Many proteins with different size and shape have been chosen to compare the results obtained from distance-based and volume-based depth calculations. From the wealth of experimental data available for hen egg white lysozyme, H/D exchange rates and TEMPOL induced paramagnetic perturbations have been analyzed both in terms of depth indexes and of atom distances to the solvent accessible surface. The algorithm here proposed yields better correlations between experimental data and atom depth, particularly for those atoms which are located near to the protein surface. AVAILABILITY Instructions to obtain source code and the executable program are available either from http://sienabiografix.com or http://sadic.sourceforge.net CONTACT niccolai@unisi.it SUPPLEMENTARY INFORMATION http://www.Sienabiogzefix.com/publication.
Collapse
Affiliation(s)
- Daniele Varrazzo
- Biomolecular Structure Research Center and Department of Molecular Biology, Università di Siena, I-53100 Siena, Italy
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Ko J, Murga LF, André P, Yang H, Ondrechen MJ, Williams RJ, Agunwamba A, Budil DE. Statistical criteria for the identification of protein active sites using theoretical microscopic titration curves. Proteins 2005; 59:183-95. [PMID: 15739204 DOI: 10.1002/prot.20418] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Theoretical Microscopic Titration Curves (THEMATICS) may be used to identify chemically important residues in active sites of enzymes by characteristic deviations from the normal, sigmoidal Henderson-Hasselbalch titration behavior. Clusters of such deviant residues in physical proximity constitute reliable predictors of the location of the active site. Originally the residues with deviant predicted behavior were identified by human observation of the computed titration curves. However, it is preferable to select the unusual residues by mathematically well-defined criteria, in order to reduce the chance of error, eliminate any possible biases, and substantially speed up the selection process. Here we present some simple statistical tests that constitute such selection criteria. The first derivatives of the predicted titration curves resemble distribution functions and are normalized. The moments of these first derivative functions are computed. It is shown that the third and fourth moments, measures of asymmetry and kurtosis, respectively, are good measures of the deviations from normal behavior. Results are presented for 44 different enzymes. Detailed results are given for 4 enzymes with 4 different types of chemistry: arginine kinase from Limulus polyphemus (horseshoe crab); beta-lactamase from Escherichia coli; glutamate racemase from Aquifex pyrophilus; and 3-isopropylmalate dehydrogenase from Thiobacillus ferrooxidans. The relationship between the statistical measures of nonsigmoidal behavior in the predicted titration curves and the catalytic activity of the residue is discussed.
Collapse
Affiliation(s)
- Jaeju Ko
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, Massachusetts, USA
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Pazos F, Sternberg MJE. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A 2004; 101:14754-9. [PMID: 15456910 PMCID: PMC522026 DOI: 10.1073/pnas.0404569101] [Citation(s) in RCA: 122] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2004] [Indexed: 11/18/2022] Open
Abstract
Current structural genomics projects are yielding structures for proteins whose functions are unknown. Accordingly, there is a pressing requirement for computational methods for function prediction. Here we present PHUNCTIONER, an automatic method for structure-based function prediction using automatically extracted functional sites (residues associated to functions). The method relates proteins with the same function through structural alignments and extracts 3D profiles of conserved residues. Functional features to train the method are extracted from the Gene Ontology (GO) database. The method extracts these features from the entire GO hierarchy and hence is applicable across the whole range of function specificity. 3D profiles associated with 121 GO annotations were extracted. We tested the power of the method both for the prediction of function and for the extraction of functional sites. The success of function prediction by our method was compared with the standard homology-based method. In the zone of low sequence similarity (approximately 15%), our method assigns the correct GO annotation in 90% of the protein structures considered, approximately 20% higher than inheritance of function from the closest homologue.
Collapse
Affiliation(s)
- Florencio Pazos
- Structural Bioinformatics Group, Biochemistry Building, Department of Biological Sciences, Imperial College London, London SW7 2AZ, UK
| | | |
Collapse
|