1
|
Gu L, Chen T, Li J, Huang YA, Du Z, Leung VCM, Chen J. Hybrid Bayesian Optimization-Based Graphical Discovery for Methylation Sites Prediction. IEEE J Biomed Health Inform 2024; 28:1917-1926. [PMID: 37801389 DOI: 10.1109/jbhi.2023.3322560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/08/2023]
Abstract
Protein methylation is one of the most important reversible post-translational modifications (PTMs), playing a vital role in the regulation of gene expression. Protein methylation sites serve as biomarkers in cardiovascular and pulmonary diseases, influencing various aspects of normal cell biology and pathogenesis. Nonetheless, the majority of existing computational methods for predicting protein methylation sites (PMSP) have been constructed based on protein sequences, with few methods leveraging the topological information of proteins. To address this issue, we propose an innovative framework for predicting Methylation Sites using Graphs (GraphMethySite) that employs graph convolution network in conjunction with Bayesian Optimization (BO) to automatically discover the graphical structure surrounding a candidate site and improve the predictive accuracy. In order to extract the most optimal subgraphs associated with methylation sites, we extend GraphMethySite by coupling it with a hybrid Bayesian optimization (together named GraphMethySite +) to determine and visualize the topological relevance among amino-acid residues. We evaluated our framework on two extended protein methylation datasets, and empirical results demonstrate that it outperforms existing state-of-the-art methylation prediction methods.
Collapse
|
2
|
Zhang J, Chen Q, Liu B. iNucRes-ASSH: Identifying nucleic acid-binding residues in proteins by using self-attention-based structure-sequence hybrid neural network. Proteins 2024; 92:395-410. [PMID: 37915276 DOI: 10.1002/prot.26626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 09/27/2023] [Accepted: 10/17/2023] [Indexed: 11/03/2023]
Abstract
Interaction between proteins and nucleic acids is crucial to many cellular activities. Accurately detecting nucleic acid-binding residues (NABRs) in proteins can help researchers better understand the interaction mechanism between proteins and nucleic acids. Structure-based methods can generally make more accurate predictions than sequence-based methods. However, the existing structure-based methods are sensitive to protein conformational changes, causing limited generalizability. More effective and robust approaches should be further explored. In this study, we propose iNucRes-ASSH to identify nucleic acid-binding residues with a self-attention-based structure-sequence hybrid neural network. It improves the generalizability and robustness of NABR prediction from two levels: residue representation and prediction model. Experimental results show that iNucRes-ASSH can predict the nucleic acid-binding residues even when the experimentally validated structures are unavailable and outperforms five competing methods on a recent benchmark dataset and a widely used test dataset.
Collapse
Affiliation(s)
- Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
3
|
Xia Y, Xia CQ, Pan X, Shen HB. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021; 49:e51. [PMID: 33577689 PMCID: PMC8136796 DOI: 10.1093/nar/gkab044] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 02/09/2021] [Indexed: 11/24/2022] Open
Abstract
Knowledge of the interactions between proteins and nucleic acids is the basis of understanding various biological activities and designing new drugs. How to accurately identify the nucleic-acid-binding residues remains a challenging task. In this paper, we propose an accurate predictor, GraphBind, for identifying nucleic-acid-binding residues on proteins based on an end-to-end graph neural network. Considering that binding sites often behave in highly conservative patterns on local tertiary structures, we first construct graphs based on the structural contexts of target residues and their spatial neighborhood. Then, hierarchical graph neural networks (HGNNs) are used to embed the latent local patterns of structural and bio-physicochemical characteristics for binding residue recognition. We comprehensively evaluate GraphBind on DNA/RNA benchmark datasets. The results demonstrate the superior performance of GraphBind than state-of-the-art methods. Moreover, GraphBind is extended to other ligand-binding residue prediction to verify its generalization capability. Web server of GraphBind is freely available at http://www.csbio.sjtu.edu.cn/bioinf/GraphBind/.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Chun-Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.,School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
4
|
Oldfield T. Mathematical Data Mining. CRYSTALLOGR REV 2014. [DOI: 10.1080/08893110410001664909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
5
|
Nagel K, Jimeno-Yepes A, Rebholz-Schuhmann D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 2009; 10 Suppl 8:S4. [PMID: 19758468 PMCID: PMC2745586 DOI: 10.1186/1471-2105-10-s8-s4] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background A protein annotation database, such as the Universal Protein Resource knowledge base (UniProtKb), is a valuable resource for the validation and interpretation of predicted 3D structure patterns in proteins. Existing studies have focussed on point mutation extraction methods from biomedical literature which can be used to support the time consuming work of manual database curation. However, these methods were limited to point mutation extraction and do not extract features for the annotation of proteins at the residue level. Results This work introduces a system that identifies protein residues in MEDLINE abstracts and annotates them with features extracted from the context written in the surrounding text. MEDLINE abstract texts have been processed to identify protein mentions in combination with taxonomic species and protein residues (F1-measure 0.52). The identified protein-species-residue triplets have been validated and benchmarked against reference data resources (UniProtKb, average F1-measure of 0.54). Then, contextual features were extracted through shallow and deep parsing and the features have been classified into predefined categories (F1-measure ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation types in UniProtKb to assess the relevance of the annotations for ongoing curation projects. Altogether, the annotations have been assessed automatically and manually against reference data resources. Conclusion This work proposes a solution for the automatic extraction of functional annotation for protein residues from biomedical articles. The presented approach is an extension to other existing systems in that a wider range of residue entities are considered and that features of residues are extracted as annotations.
Collapse
Affiliation(s)
- Kevin Nagel
- European Bioinformatics Institute, Hinxton, Cambridge, UK.
| | | | | |
Collapse
|
6
|
Skolnick J, Brylinski M. FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinform 2009; 10:378-91. [PMID: 19324930 DOI: 10.1093/bib/bbp017] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology 250 14th St NW, Atlanta, GA 30318, USA.
| | | |
Collapse
|
7
|
Gherardini PF, Helmer-Citterich M. Structure-based function prediction: approaches and applications. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2008; 7:291-302. [PMID: 18599513 DOI: 10.1093/bfgp/eln030] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The ever increasing number of protein structures determined by structural genomic projects has spurred much interest in the development of methods for structure-based function prediction. Existing methods can be roughly classified in two groups: some use a comparative approach looking for the presence of structural motifs possibly associated with a known biochemical function. Other methods try to identify functional patches on the surface of a protein using only its physicochemical characteristics. This review will cover both kinds of approaches to structure-based function prediction as well as their use in real-world cases. The main issues and limitations in using protein structure to predict function will also be discussed. These are mainly: the assessment of the statistical significance of structural similarities and the extent to which these methods depend on the accuracy and availability of structural data.
Collapse
Affiliation(s)
- Pier Federico Gherardini
- Department of Biology, Centre for Molecular Bioinformatics, University of Tor Vergata, Rome, Italy.
| | | |
Collapse
|
8
|
Rossi A, Marti-Renom MA, Sali A. Localization of binding sites in protein structures by optimization of a composite scoring function. Protein Sci 2006; 15:2366-80. [PMID: 16963645 PMCID: PMC2242385 DOI: 10.1110/ps.062247506] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The rise in the number of functionally uncharacterized protein structures is increasing the demand for structure-based methods for functional annotation. Here, we describe a method for predicting the location of a binding site of a given type on a target protein structure. The method begins by constructing a scoring function, followed by a Monte Carlo optimization, to find a good scoring patch on the protein surface. The scoring function is a weighted linear combination of the z-scores of various properties of protein structure and sequence, including amino acid residue conservation, compactness, protrusion, convexity, rigidity, hydrophobicity, and charge density; the weights are calculated from a set of previously identified instances of the binding-site type on known protein structures. The scoring function can easily incorporate different types of information useful in localization, thus increasing the applicability and accuracy of the approach. To test the method, 1008 known protein structures were split into 20 different groups according to the type of the bound ligand. For nonsugar ligands, such as various nucleotides, binding sites were correctly identified in 55%-73% of the cases. The method is completely automated (http://salilab.org/patcher) and can be applied on a large scale in a structural genomics setting.
Collapse
Affiliation(s)
- Andrea Rossi
- Department of Biopharmaceutical Sciences and Pharmaceutical Chemistry, California Institute for Quantitative Biomedical Research, University of California, San Francisco, California 94143-2552, USA.
| | | | | |
Collapse
|
9
|
Abstract
MOTIVATION With the increasing availability of protein structures, the generation of biologically meaningful 3D patterns from the simultaneous alignment of several protein structures is an exciting prospect: active sites could be better understood, protein functions and protein 3D structures could be predicted more accurately. Although patterns can already be generated at the fold and topological levels, no system produces high-resolution 3D patterns including atom and cavity positions. To address this challenge, our research focuses on generating patterns from proteins with rigid prosthetic groups. Since these groups are key elements of protein active sites, the generated 3D patterns are expected to be biologically meaningful. RESULTS In this paper, we present a new approach which allows the generation of 3D patterns from proteins with rigid prosthetic groups. Using 237 protein chains representing proteins containing porphyrin rings, our method was validated by comparing 3D templates generated from homologues with the 3D structure of the proteins they model. Atom positions were predicted reliably: 93% of them had an accuracy of 1.00 A or less. Moreover, similar results were obtained regarding chemical group and cavity positions. Results also suggested our system could contribute to the validation of 3D protein models. Finally, a 3D template was generated for the active site of human cytochrome P450 CYP17, the 3D structure of which is unknown. Its analysis showed that it is biologically meaningful: our method detected the main patterns of the cytochrome P450 superfamily and the motifs linked to catalytic reactions. The 3D template also suggested the position of a residue, which could be involved in a hydrogen bond with CYP17 substrates and the shape and location of a cavity. Comparisons with independently generated 3D models comforted these hypotheses. AVAILABILITY Alignment software (Nestor3D) is available at http://www.kingston.ac.uk/~ku33185/Nestor3D.html
Collapse
Affiliation(s)
- Jean-Christophe Nebel
- Faculty of Computing, Information Systems & Mathematics, Kingston University Kingston-upon-Thames, Surrey KT1 2EE, UK.
| |
Collapse
|
10
|
Golovin A, Dimitropoulos D, Oldfield T, Rachedi A, Henrick K. MSDsite: a database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins 2006; 58:190-9. [PMID: 15468317 DOI: 10.1002/prot.20288] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The three-dimensional environments of ligand binding sites have been derived from the parsing and loading of the PDB entries into a relational database. For each bound molecule the biological assembly of the quaternary structure has been used to determine all contact residues and a fast interactive search and retrieval system has been developed. Prosite pattern and short sequence search options are available together with a novel graphical query generator for inter-residue contacts. The database and its query interface are accessible from the Internet through a web server located at: http://www.ebi.ac.uk/msd-srv/msdsite.
Collapse
Affiliation(s)
- Adel Golovin
- EMBL Outstation, The European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | | | | | | | | |
Collapse
|
11
|
Abstract
MOTIVATION Function inference from structure is facilitated by the use of patterns of residues (3D motifs), normally identified by expert knowledge, that correlate with function. As an alternative to often limited expert knowledge, we use machine-learning techniques to identify patterns of 3-10 residues that maximize function prediction. This approach allows us to test the assumption that residues that provide function are the most informative for predicting function. RESULTS We apply our method, GASPS, to the haloacid dehalogenase, enolase, amidohydrolase and crotonase superfamilies and to the serine proteases. The motifs found by GASPS are as good at function prediction as 3D motifs based on expert knowledge. The GASPS motifs with the greatest ability to predict protein function consist mainly of known functional residues. However, several residues with no known functional role are equally predictive. For four groups, we show that the predictive power of our 3D motifs is comparable with or better than approaches that use the entire fold (Combinatorial-Extension) or sequence profiles (PSI-BLAST). AVAILABILITY Source code is freely available for academic use by contacting the authors. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Benjamin J Polacco
- Department of Biopharmaceutical Sciences, University of California, San Francisco, 94143-2250, USA
| | | |
Collapse
|
12
|
Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. J Mol Biol 2005; 351:614-26. [PMID: 16019027 DOI: 10.1016/j.jmb.2005.05.067] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2005] [Revised: 05/26/2005] [Accepted: 05/30/2005] [Indexed: 10/25/2022]
Abstract
The prediction of a protein's function from its 3D structure is becoming more and more important as the worldwide structural genomics initiatives gather pace and continue to solve 3D structures, many of which are of proteins of unknown function. Here, we present a methodology for predicting function from structure that shows great promise. It is based on 3D templates that are defined as specific 3D conformations of small numbers of residues. We use four types of template, covering enzyme active sites, ligand-binding residues, DNA-binding residues and reverse templates. The latter are templates generated from the target structure itself and scanned against a representative subset of all known protein structures. Together, the templates provide a fairly thorough coverage of the known structures and ensure that if there is a match to a known structure it is unlikely to be missed. A new scoring scheme provides a highly sensitive means of discriminating between true positive and false positive template matches. In all, the methodology provides a powerful new tool for function prediction to complement those already in use.
Collapse
Affiliation(s)
- Roman A Laskowski
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
13
|
Watson JD, Laskowski RA, Thornton JM. Predicting protein function from sequence and structural data. Curr Opin Struct Biol 2005; 15:275-84. [PMID: 15963890 DOI: 10.1016/j.sbi.2005.04.003] [Citation(s) in RCA: 198] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2005] [Revised: 02/04/2005] [Accepted: 04/18/2005] [Indexed: 10/25/2022]
Abstract
When a protein's function cannot be experimentally determined, it can often be inferred from sequence similarity. Should this process fail, analysis of the protein structure can provide functional clues or confirm tentative functional assignments inferred from the sequence. Many structure-based approaches exist (e.g. fold similarity, three-dimensional templates), but as no single method can be expected to be successful in all cases, a more prudent approach involves combining multiple methods. Several automated servers that integrate evidence from multiple sources have been released this year and particular improvements have been seen with methods utilizing the Gene Ontology functional annotation schema.
Collapse
Affiliation(s)
- James D Watson
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
14
|
Torrance JW, Bartlett GJ, Porter CT, Thornton JM. Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. J Mol Biol 2005; 347:565-81. [PMID: 15755451 DOI: 10.1016/j.jmb.2005.01.044] [Citation(s) in RCA: 101] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2004] [Revised: 01/13/2005] [Accepted: 01/19/2005] [Indexed: 11/20/2022]
Abstract
Catalytic site structure is normally highly conserved between distantly related enzymes. As a consequence, templates representing catalytic sites have the potential to succeed at function prediction in cases where methods based on sequence or overall structure fail. There are many methods for searching protein structures for matches to structural templates, but few validated template libraries to use with these methods. We present a library of structural templates representing catalytic sites, based on information from the scientific literature. Furthermore, we analyse homologous template families to discover the diversity within families and the utility of templates for active site recognition. Templates representing the catalytic sites of homologous proteins mostly differ by less than 1A root mean square deviation, even when the sequence similarity between the two proteins is low. Within these sets of homologues there is usually no discernible relationship between catalytic site structure similarity and sequence similarity. Because of this structural conservation of catalytic sites, the templates can discriminate between matches to related proteins and random matches with over 85% sensitivity and predictive accuracy. Templates based on protein backbone positions are more discriminating than those based on side-chain atoms. These analyses show encouraging prospects for prediction of functional sites in structural genomics structures of unknown function, and will be of use in analyses of convergent evolution and exploring relationships between active site geometry and chemistry. The template library can be queried via a web server at and is available for download.
Collapse
Affiliation(s)
- James W Torrance
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | |
Collapse
|
15
|
Huang HC, Nagaswamy U, Fox GE. The application of cluster analysis in the intercomparison of loop structures in RNA. RNA (NEW YORK, N.Y.) 2005; 11:412-423. [PMID: 15769871 PMCID: PMC1370731 DOI: 10.1261/rna.7104605] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2004] [Accepted: 12/18/2004] [Indexed: 05/24/2023]
Abstract
We have developed a computational approach for the comparison and classification of RNA loop structures. Hairpin or interior loops identified in atomic resolution RNA structures were intercompared by conformational matching. The root-mean-square deviation (RMSD) values between all pairs of RNA fragments of interest, even if from different molecules, are calculated. Subsequently, cluster analysis is performed on the resulting matrix of RMSD distances using the unweighted pair group method with arithmetic mean (UPGMA). The cluster analysis objectively reveals groups of folds that resemble one another. To demonstrate the utility of the approach, a comprehensive analysis of all the terminal hairpin tetraloops that have been observed in 15 RNA structures that have been determined by X-ray crystallography was undertaken. The method found major clusters corresponding to the well-known GNRA and UNCG types. In addition, two tetraloops with the unusual primary sequence UMAC (M is A or C) were successfully assigned to the GNRA cluster. Larger loop structures were also examined and the clustering results confirmed the occurrence of variations of the GNRA and UNCG tetraloops in these loops and provided a systematic means for locating them. Nineteen examples of larger loops that closely resemble either the GNRA or UNCG tetraloop were found in the large ribosomal RNAs. When the clustering approach was extended to include all structures in the SCOR database, novel relationships were detected including one between the ANYA motif and a less common folding of the GAAA tetraloop sequence.
Collapse
Affiliation(s)
- Hung-Chung Huang
- Department of Biology and Biochemistry, Houston Science Center, Room 402, 3201 Cullen Blvd., University of Houston, Houston, TX 77204, USA
| | | | | |
Collapse
|
16
|
Brakoulias A, Jackson RM. Towards a structural classification of phosphate binding sites in protein-nucleotide complexes: an automated all-against-all structural comparison using geometric matching. Proteins 2004; 56:250-60. [PMID: 15211509 DOI: 10.1002/prot.20123] [Citation(s) in RCA: 93] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
A method is described for the rapid comparison of protein binding sites using geometric matching to detect similar three-dimensional structure. The geometric matching detects common atomic features through identification of the maximum common sub-graph or clique. These features are not necessarily evident from sequence or from global structural similarity giving additional insight into molecular recognition not evident from current sequence or structural classification schemes. Here we use the method to produce an all-against-all comparison of phosphate binding sites in a number of different nucleotide phosphate-binding proteins. The similarity search is combined with clustering of similar sites to allow a preliminary structural classification. Clustering by site similarity produces a classification of binding sites for the 476 representative local environments producing ten main clusters representing half of the representative environments. The similarities make sense in terms of both structural and functional classification schemes. The ten main clusters represent a very limited number of unique structural binding motifs for phosphate. These are the structural P-loop, di-nucleotide binding motif [FAD/NAD(P)-binding and Rossman-like fold] and FAD-binding motif. Similar classification schemes for nucleotide binding proteins have also been arrived at independently by others using different methods.
Collapse
Affiliation(s)
- Andreas Brakoulias
- Department of Biochemistry & Molecular Biology, University College London, Gower Street, London, England
| | | |
Collapse
|
17
|
Lütteke T, Frank M, von der Lieth CW. Data mining the protein data bank: automatic detection and assignment of carbohydrate structures. Carbohydr Res 2004; 339:1015-20. [PMID: 15010309 DOI: 10.1016/j.carres.2003.09.038] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2003] [Accepted: 09/15/2003] [Indexed: 11/26/2022]
Abstract
Knowledge of the 3D structure of glycans is a prerequisite for a complete understanding of the biological processes glycoproteins are involved in. However, due to a lack of standardised nomenclature, carbohydrate compounds are difficult to locate within the Protein Data Bank (PDB). Using an algorithm that detects carbohydrate structures only requiring element types and atom coordinates, we were able to detect 1663 entries containing a total of 5647 carbohydrate chains. The majority of chains are found to be N-glycosidically bound. Noncovalently bound ligands are also frequent, while O-glycans form a minority. About 30% of all carbohydrate containing PDB entries comprise one or several errors. The automatic assignment of carbohydrate structures in PDB entries will improve the cross-linking of glycobiology resources with genomic and proteomic data collections, which will be an important issue of the upcoming glycomics projects. By aiding in detection of erroneous annotations and structures, the algorithm might also help to increase database quality.
Collapse
Affiliation(s)
- Thomas Lütteke
- Central Spectroscopic Department, German Cancer Research Center, INF 280, D-69120 Heidelberg, Germany.
| | | | | |
Collapse
|
18
|
Abstract
Here we describe various methods currently under development aimed at identifying a protein's function from its three-dimensional structure. We are combining a number of these methods to create a pipeline of applications, called ProFunc, which will take a given 3D structure, run all the applications on it and compile and summarise the results obtained. The aim is to provide a best guess as to the protein's function from the evidence provided by the different methods. Here we present three examples, using structures solved by the Midwest Center for Structural Genomics consortium, illustrating the strengths and weaknesses of current approaches.
Collapse
Affiliation(s)
- Roman A Laskowski
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
19
|
Chelli R, Gervasio FL, Procacci P, Schettino V. Inter-residue and solvent-residue interactions in proteins: a statistical study on experimental structures. Proteins 2004; 55:139-51. [PMID: 14997548 DOI: 10.1002/prot.20030] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
A large set of protein structures resolved by X-ray or NMR techniques has been extracted from the Protein Data Bank and analyzed using statistical methods. In particular, we investigate the interactions between side chains and the interactions between solvent and side chains, pointing out on the possibility of including the solvent as part of a knowledge-based potential. The solvent-residue contacts are accounted for on the basis of the Voronoi's polyhedron analysis. Our investigation confirms the importance of hydrophobic residues in determining the protein stability. We observe that in general hydrophobic-hydrophobic interactions and, more specifically, aromatic-aromatic contacts tend to be increasingly distally separated in the primary sequence of proteins, thus connecting distinct secondary structure elements. A simple relation expressing the dependence of the protein free energy by the number of residues is proposed. Such a relation includes both the residue-residue and the solvent-residue contributions. The former is dominant for large size proteins, whereas for small sizes (number of residues less than 100) the two terms are comparable. Gapless threading experiments show that the solvent-residue knowledge-based potential yields a significant contribution with respect to discriminating the native structure of proteins. Such contribution is important especially for proteins of small size and is similar to that given by the most favorable residue-residue knowledge-based potential referring to hydrophobic-hydrophobic interactions such as isoleucine-leucine. In general, the inclusion of the solvent-residue interaction produces a relevant increase of the free energy gap between the native structures and decoys.
Collapse
Affiliation(s)
- Riccardo Chelli
- Dipartimento di Chimica, Università di Firenze, Via della Lastruccia 3, 50019 Sesto Fiorentino, Italy
| | | | | | | |
Collapse
|
20
|
Tendulkar AV, Wangikar PP, Sohoni MA, Samant VV, Mone CY. Parameterization and Classification of the Protein Universe via Geometric Techniques. J Mol Biol 2003; 334:157-72. [PMID: 14596807 DOI: 10.1016/j.jmb.2003.09.021] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We present a scheme for the classification of 3487 non-redundant protein structures into 1207 non-hierarchical clusters by using recurring structural patterns of three to six amino acids as keys of classification. This results in several signature patterns, which seem to decide membership of a protein in a functional category. The patterns provide clues to the key residues involved in functional sites as well as in protein-protein interaction. The discovered patterns include a "glutamate double bridge" of superoxide dismutase, the functional interface of the serine protease and inhibitor, interface of homo/hetero dimers, and functional sites of several enzyme families. We use geometric invariants to decide superimposability of structural patterns. This allows the parameterization of patterns and discovery of recurring patterns via clustering. The geometric invariant-based approach eliminates the computationally explosive step of pair-wise comparison of structures. The results provide a vast resource for the biologists for experimental validation of the proposed functional sites, and for the design of synthetic enzymes, inhibitors and drugs.
Collapse
Affiliation(s)
- Ashish V Tendulkar
- Kanwal Rekhi School of Information Technology, Indian Institute of Technology, Bombay, Powai, Mumbai 400 076, India
| | | | | | | | | |
Collapse
|
21
|
Gutteridge A, Bartlett GJ, Thornton JM. Using a neural network and spatial clustering to predict the location of active sites in enzymes. J Mol Biol 2003; 330:719-34. [PMID: 12850142 DOI: 10.1016/s0022-2836(03)00515-1] [Citation(s) in RCA: 136] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Structural genomics projects aim to provide a sharp increase in the number of structures of functionally unannotated, and largely unstudied, proteins. Algorithms and tools capable of deriving information about the nature, and location, of functional sites within a structure are increasingly useful therefore. Here, a neural network is trained to identify the catalytic residues found in enzymes, based on an analysis of the structure and sequence. The neural network output, and spatial clustering of the highly scoring residues are then used to predict the location of the active site.A comparison of the performance of differently trained neural networks is presented that shows how information from sequence and structure come together to improve the prediction accuracy of the network. Spatial clustering of the network results provides a reliable way of finding likely active sites. In over 69% of the test cases the active site is correctly predicted, and a further 25% are partially correctly predicted. The failures are generally due to the poor quality of the automatically generated sequence alignments. We also present predictions identifying the active site, and potential functional residues in five recently solved enzyme structures, not used in developing the method. The method correctly identifies the putative active site in each case. In most cases the likely functional residues are identified correctly, as well as some potentially novel functional groups.
Collapse
Affiliation(s)
- Alex Gutteridge
- EBI, Wellcome Trust Genome Campus, EMBL Outstation, Hinxton, Cambridgeshire CB10 1SD, UK
| | | | | |
Collapse
|
22
|
Abstract
The study of structural genomics and structural proteomics has determined the tertiary structures of many hypothetical proteins, whose molecular functions could not be understood using conventional methods. In order to infer the geometrical location of the functional site, the biochemical function and the biological function of the hypothetical protein, much effort has been made in protein informatics. The importance of heterogeneous databases and various descriptors of amino acid sequences, tertiary structures and pathways on the proteome scale has been emphasised.
Collapse
Affiliation(s)
- Kengo Kinoshita
- Graduate School of Integrated Science, Yokohama City University, 1-7-29 Suehiro-cho, Turumi-ku, 230-0045, Yokohama, Japan.
| | | |
Collapse
|