1
|
Mizianty MJ, Kurgan L. Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 2011; 27:i24-33. [PMID: 21685077 PMCID: PMC3117383 DOI: 10.1093/bioinformatics/btr229] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION X-ray crystallography-based protein structure determination, which accounts for majority of solved structures, is characterized by relatively low success rates. One solution is to build tools which support selection of targets that are more likely to crystallize. Several in silico methods that predict propensity of diffraction-quality crystallization from protein chains were developed. We show that the quality of their predictions drops when applied to more recent crystallization trails, which calls for new solutions. We propose a novel approach that alleviates drawbacks of the existing methods by using a recent dataset and improved protocol to annotate progress along the crystallization process, by predicting the success of the entire process and steps which result in the failed attempts, and by utilizing a compact and comprehensive set of sequence-derived inputs to generate accurate predictions. RESULTS The proposed PPCpred (predictor of protein Production, Purification and Crystallization) predict propensity for production of diffraction-quality crystals, production of crystals, purification and production of the protein material. PPCpred utilizes comprehensive set of inputs based on energy and hydrophobicity indices, composition of certain amino acid types, predicted disorder, secondary structure and solvent accessibility, and content of certain buried and exposed residues. Our method significantly outperforms alignment-based predictions and several modern crystallization propensity predictors. Receiver operating characteristic (ROC) curves show that PPCpred is particularly useful for users who desire high true positive (TP) rates, i.e. low rate of mispredictions for solvable chains. Our model reveals several intuitive factors that influence the success of individual steps and the entire crystallization process, including the content of Cys, buried His and Ser, hydrophobic/hydrophilic segments and the number of predicted disordered segments. AVAILABILITY http://biomine.ece.ualberta.ca/PPCpred/. CONTACT lkurgan@ece.ualberta.ca.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
| | | |
Collapse
|
2
|
Linial M. Fishing with (Proto)Net-a principled approach to protein target selection. Comp Funct Genomics 2008; 4:542-8. [PMID: 18629007 PMCID: PMC2447289 DOI: 10.1002/cfg.328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2003] [Revised: 08/05/2003] [Accepted: 08/05/2003] [Indexed: 12/02/2022] Open
Abstract
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have
not been determined, but that represent an as yet unknown structural superfamily
or fold. Once such a structure is solved, it can be used as a template for modelling
homologous proteins. This will aid in unveiling the structural diversity of the protein
space. Currently, no reliable method for accurate 3D structural prediction is available
when a sequence or a structure homologue is not available. Here we present a
systematic methodology for selecting target proteins whose structure is likely to
adopt a new, as yet unknown superfamily or fold. Our method takes advantage
of a global classification of the sequence space as presented by ProtoNet-3D, which
is a hierarchical agglomerative clustering of the proteins of interest (the proteins in
Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in
the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet
structurally solved, along with the probability of each of the proteins belonging to a
new superfamily or fold. The sorted list has been self-validated against real structural
data that was not available when the predictions were made. The practical application
of using our computational–statistical method to determine novel superfamilies for
structural genomics projects is also discussed.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
| |
Collapse
|
3
|
Abstract
Many classification schemes for proteins and domains are either hierarchical or semi-hierarchical yet most databases, especially those offering genome-wide analysis, only provide assignments to sequences at one level of their hierarchy. Given an established hierarchy, the problem of assigning new sequences to lower levels of that existing hierarchy is less hard (but no less important) than the initial top level assignment which requires the detection of the most distant relationships. A solution to this problem is described here in the form of a new procedure which can be thought of as a hybrid between pairwise and profile methods. The hybrid method is a general procedure that can be applied to any pre-defined hierarchy, at any level, including in principle multiple sub-levels. It has been tested on the SCOP classification via the SUPERFAMILY database and performs significantly better than either pairwise or profile methods alone. Perhaps the greatest advantage of the hybrid method over other possible approaches to the problem is that within the framework of an existing profile library, the assignments are fully automatic and come at almost no additional computational cost. Hence it has already been applied at the SCOP family level to all genomes in the SUPERFAMILY database, providing a wealth of new data to the biological and bioinformatics communities.
Collapse
Affiliation(s)
- Julian Gough
- Unite de Bioinformatique Structurale, Institut Pasteur, 25-28 Rue du Docteur Roux, 75724 Paris Cedex 15, Paris, France.
| |
Collapse
|
4
|
Miyazaki S, Kuroda Y, Yokoyama S. Identification of putative domain linkers by a neural network - application to a large sequence database. BMC Bioinformatics 2006; 7:323. [PMID: 16800897 PMCID: PMC1538634 DOI: 10.1186/1471-2105-7-323] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2006] [Accepted: 06/27/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences). RESULTS Our search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs. CONCLUSION Overall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.
Collapse
Affiliation(s)
- Satoshi Miyazaki
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| | - Yutaka Kuroda
- Department of Biotechnology and Life Science, Graduate School of Technology, Tokyo University of Agriculture and Technology, 2-24-16, Nakamachi, Koganei, 184-8588, Tokyo, Japan
| | - Shigeyuki Yokoyama
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
- RIKEN Genomic Sciences Center, 1-7-22, Suehiro-cho, Tsurumi, Yokohama 230-0045, Japan
| |
Collapse
|
5
|
Abstract
The ability to form tenable hypotheses regarding the neurobiological basis of normative functions as well as mechanisms underlying neurodegenerative and neuropsychiatric disorders is often limited by the highly complex brain circuitry and the cellular and molecular mosaics therein. The brain is an intricate structure with heterogeneous neuronal and nonneuronal cell populations dispersed throughout the central nervous system. Varied and diverse brain functions are mediated through gene expression, and ultimately protein expression, within these cell types and interconnected circuits. Large-scale high-throughput analysis of gene expression in brain regions and individual cell populations using modern functional genomics technologies has enabled the simultaneous quantitative assessment of dozens to hundreds to thousands of genes. Technical and experimental advances in the accession of tissues, RNA amplification technologies, and the refinement of downstream genetic methodologies including microarray analysis and real-time quantitative PCR have generated a wellspring of informative studies pertinent to understanding brain structure and function. In this review, we outline the advantages as well as some of the potential challenges of applying high throughput functional genomics technologies toward a better understanding of brain tissues and diseases using animal models as well as human postmortem tissues.
Collapse
|
6
|
Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures. J Mol Biol 2005; 348:1235-60. [PMID: 15854658 DOI: 10.1016/j.jmb.2005.03.037] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2004] [Revised: 02/28/2005] [Accepted: 03/15/2005] [Indexed: 11/27/2022]
Abstract
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.
Collapse
Affiliation(s)
- Annabel E Todd
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | |
Collapse
|
7
|
Liu J, Hegyi H, Acton TB, Montelione GT, Rost B. Automatic target selection for structural genomics on eukaryotes. Proteins 2004; 56:188-200. [PMID: 15211504 DOI: 10.1002/prot.20012] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | | | | | | | |
Collapse
|
8
|
Abstract
Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology and other aspects of predicted protein structure. Here, we introduced CHOPnet, a de novo method that predicts structural domains in the absence of homology to known domains. Our method was based on neural networks and relied exclusively on information available for all proteins. Evaluating sustained performance through rigorous cross-validation on proteins of known structure, we correctly predicted the number of domains in 69% of all proteins. For 50% of the two-domain proteins the centre of the predicted boundary was closer than 20 residues to the boundary assigned from three-dimensional (3D) structures; this was about eight percentage points better than predictions by 'equal split'. Our results appeared to compare favourably with those from previously published methods. CHOPnet may be useful to restrict the experimental testing of different fragments for structure determination in the context of structural genomics.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
9
|
Marti‐Renom MA, Madhusudhan M, Eswar N, Pieper U, Shen M, Sali A, Fiser A, Mirkovic N, John B, Stuart A. Modeling Protein Structure from its Sequence. ACTA ACUST UNITED AC 2003. [DOI: 10.1002/0471250953.bi0501s03] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Marc A. Marti‐Renom
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - M.S. Madhusudhan
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Narayanan Eswar
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Ursula Pieper
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Min‐yi Shen
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andrej Sali
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andras Fiser
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics Albert Einstein College of Medicine Bronx New York
| | - Nebojsa Mirkovic
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Bino John
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Ashley Stuart
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| |
Collapse
|
10
|
Goh CS, Lan N, Echols N, Douglas SM, Milburn D, Bertone P, Xiao R, Ma LC, Zheng D, Wunderlich Z, Acton T, Montelione GT, Gerstein M. SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 2003; 31:2833-8. [PMID: 12771210 PMCID: PMC156730 DOI: 10.1093/nar/gkg397] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present version 2 of the SPINE system for structural proteomics. SPINE is available over the web at http://nesg.org. It serves as the central hub for the Northeast Structural Genomics Consortium, allowing collaborative structural proteomics to be carried out in a distributed fashion. The core of SPINE is a laboratory information management system (LIMS) for key bits of information related to the progress of the consortium in cloning, expressing and purifying proteins and then solving their structures by NMR or X-ray crystallography. Originally, SPINE focused on tracking constructs, but, in its current form, it is able to track target sample tubes and store detailed sample histories. The core database comprises a set of standard relational tables and a data dictionary that form an initial ontology for proteomic properties and provide a framework for large-scale data mining. Moreover, SPINE sits at the center of a federation of interoperable information resources. These can be divided into (i) local resources closely coupled with SPINE that enable it to handle less standardized information (e.g. integrated mailing and publication lists), (ii) other information resources in the NESG consortium that are inter-linked with SPINE (e.g. crystallization LIMS local to particular laboratories) and (iii) international archival resources that SPINE links to and passes on information to (e.g. TargetDB at the PDB).
Collapse
Affiliation(s)
- Chern-Sing Goh
- Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Jackson DB, Minch E, Munro RE. Bioinformatics. EXS 2003:31-69. [PMID: 12613171 DOI: 10.1007/978-3-0348-7997-2_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2023]
|
12
|
Lan N, Montelione GT, Gerstein M. Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level. Curr Opin Chem Biol 2003; 7:44-54. [PMID: 12547426 DOI: 10.1016/s1367-5931(02)00020-0] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A principal aim of post-genomic biology is elucidating the structures, functions and biochemical properties of all gene products in a genome. However, to adequately comprehend such a large amount of information we need new descriptions of proteins that scale to the genomic level. In short, we need a unified ontology for proteomics. Much progress has been made towards this end, including a variety of approaches to systematic structural and functional classification and initial work towards developing standardized, unified descriptions for protein properties. In relation to function, there is a particularly great diversity of approaches, involving placing a protein in structured hierarchies or more-generalized networks and a recent approach based on circumscribing a protein's function through systematic enumeration of molecular interactions.
Collapse
Affiliation(s)
- Ning Lan
- Department of Molecular Biophysics, New Haven, CT 06520, USA.
| | | | | |
Collapse
|
13
|
Westbrook J, Feng Z, Chen L, Yang H, Berman HM. The Protein Data Bank and structural genomics. Nucleic Acids Res 2003; 31:489-91. [PMID: 12520059 PMCID: PMC165515 DOI: 10.1093/nar/gkg068] [Citation(s) in RCA: 261] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Protein Data Bank (PDB; http://www.pdb.org/) continues to be actively involved in various aspects of the informatics of structural genomics projects--developing and maintaining the Target Registration Database (TargetDB), organizing data dictionaries that will define the specification for the exchange and deposition of data with the structural genomics centers and creating software tools to capture data from standard structure determination applications.
Collapse
Affiliation(s)
- John Westbrook
- Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
| | | | | | | | | |
Collapse
|
14
|
Abstract
High-throughput sequencing of human genomes and those of important model organisms (mouse, Drosophila melanogaster, Caenorhabditis elegans, fungi, archaea) and bacterial pathogens has laid the foundation for another "big science" initiative in biology. Together, X-ray crystallographers, nuclear magnetic resonance (NMR) spectroscopists, and computational biologists are pursuing high-throughput structural studies aimed at developing a comprehensive three-dimensional view of the protein structure universe. The new science of structural genomics promises more than 10,000 experimental protein structures and millions of calculated homology models of related proteins. The evolutionary underpinnings and technological challenges of automating target selection, protein expression and purification, sample preparation, NMR and X-ray data measurement/analysis, homology modeling, and structure/function annotation are discussed in detail. An informative case study from one of the structural genomics centers funded by the National Institutes of Health and the National Institute of General Medical Sciences (NIH/NIGMS) demonstrates how this experimental/computational pipeline will reveal important links between form and function in biology and provide new insights into evolution and human health and disease.
Collapse
Affiliation(s)
- Stephen K Burley
- Howard Hughes Medical Institute, Laboratories of Molecular Biophysics, The Rockefeller University, New York New York 10021, USA.
| | | |
Collapse
|
15
|
Pandit SB, Gosar D, Abhiman S, Sujatha S, Dixit SS, Mhatre NS, Sowdhamini R, Srinivasan N. SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes. Nucleic Acids Res 2002; 30:289-93. [PMID: 11752317 PMCID: PMC99061 DOI: 10.1093/nar/30.1.289] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Members of a superfamily of proteins could result from divergent evolution of homologues with insignificant similarity in the amino acid sequences. A superfamily relationship is detected commonly after the three-dimensional structures of the proteins are determined using X-ray analysis or NMR. The SUPFAM database described here relates two homologous protein families in a multiple sequence alignment database of either known or unknown structure. The present release (1.1), which is the first version of the SUPFAM database, has been derived by analysing Pfam, which is one of the commonly used databases of multiple sequence alignments of homologous proteins. The first step in establishing SUPFAM is to relate Pfam families with the families in PALI, which is an alignment database of homologous proteins of known structure that is derived largely from SCOP. The second step involves relating Pfam families which could not be associated reliably with a protein superfamily of known structure. The profile matching procedure, IMPALA, has been used in these steps. The first step resulted in identification of 1280 Pfam families (out of 2697, i.e. 47%) which are related, either by close homologous connection to a SCOP family or by distant relationship to a SCOP family, potentially forming new superfamily connections. Using the profiles of 1417 Pfam families with apparently no structural information, an all-against-all comparison involving a sequence-profile match using IMPALA resulted in clustering of 67 homologous protein families of Pfam into 28 potential new superfamilies. Expansion of groups of related proteins of yet unknown structural information, as proposed in SUPFAM, should help in identifying 'priority proteins' for structure determination in structural genomics initiatives to expand the coverage of structural information in the protein sequence space. For example, we could assign 858 distinct Pfam domains in 2203 of the gene products in the genome of Mycobacterium tubercolosis. Fifty-one of these Pfam families of unknown structure could be clustered into 17 potentially new superfamilies forming good targets for structural genomics. SUPFAM database can be accessed at http://pauling.mbu.iisc.ernet.in/~supfam.
Collapse
Affiliation(s)
- Shashi B Pandit
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Lo Conte L, Brenner SE, Hubbard TJP, Chothia C, Murzin AG. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002; 30:264-7. [PMID: 11752311 PMCID: PMC99154 DOI: 10.1093/nar/30.1.264] [Citation(s) in RCA: 355] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The SCOP (Structural Classification of Proteins) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are grouped into species and hierarchically classified into families, superfamilies, folds and classes. Recently, we introduced a new set of features with the aim of standardizing access to the database, and providing a solid basis to manage the increasing number of experimental structures expected from structural genomics projects. These features include: a new set of identifiers, which uniquely identify each entry in the hierarchy; a compact representation of protein domain classification; a new set of parseable files, which fully describe all domains in SCOP and the hierarchy itself. These new features are reflected in the ASTRAL compendium. The SCOP search engine has also been updated, and a set of links to external resources added at the level of domain entries. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.
Collapse
Affiliation(s)
- Loredana Lo Conte
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK.
| | | | | | | | | |
Collapse
|
17
|
Abstract
Structural genomics projects aim to provide an experimental or computational three-dimensional model structure for all of the tractable macromolecules that are encoded by complete genomes. To this end, pilot centres worldwide are now exploring the feasibility of large-scale structure determination. Their experimental structures and computational models are expected to yield insight into the molecular function and mechanism of thousands of proteins. The pervasiveness of this information is likely to change the use of structure in molecular biology and biochemistry.
Collapse
Affiliation(s)
- S E Brenner
- Department of Plant and Microbial Biology, University of California, 461A Koshland Hall, Berkeley, California 94720-3102, USA.
| |
Collapse
|
18
|
Bertone P, Kluger Y, Lan N, Zheng D, Christendat D, Yee A, Edwards AM, Arrowsmith CH, Montelione GT, Gerstein M. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res 2001; 29:2884-98. [PMID: 11433035 PMCID: PMC55760 DOI: 10.1093/nar/29.13.2884] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
High-throughput structural proteomics is expected to generate considerable amounts of data on the progress of structure determination for many proteins. For each protein this includes information about cloning, expression, purification, biophysical characterization and structure determination via NMR spectroscopy or X-ray crystallography. It will be essential to develop specifications and ontologies for standardizing this information to make it amenable to retrospective analysis. To this end we created the SPINE database and analysis system for the Northeast Structural Genomics Consortium. SPINE, which is available at bioinfo.mbb.yale.edu/nesg or nesg.org, is specifically designed to enable distributed scientific collaboration via the Internet. It was designed not just as an information repository but as an active vehicle to standardize proteomics data in a form that would enable systematic data mining. The system features an intuitive user interface for interactive retrieval and modification of expression construct data, query forms designed to track global project progress and external links to many other resources. Currently the database contains experimental data on 985 constructs, of which 740 are drawn from Methanobacterium thermoautotrophicum, 123 from Saccharomyces cerevisiae, 93 from Caenorhabditis elegans and the remainder from other organisms. We developed a comprehensive set of data mining features for each protein, including several related to experimental progress (e.g. expression level, solubility and crystallization) and 42 based on the underlying protein sequence (e.g. amino acid composition, secondary structure and occurrence of low complexity regions). We demonstrate in detail the application of a particular machine learning approach, decision trees, to the tasks of predicting a protein's solubility and propensity to crystallize based on sequence features. We are able to extract a number of key rules from our trees, in particular that soluble proteins tend to have significantly more acidic residues and fewer hydrophobic stretches than insoluble ones. One of the characteristics of proteomics data sets, currently and in the foreseeable future, is their intermediate size ( approximately 500-5000 data points). This creates a number of issues in relation to error estimation. Initially we estimate the overall error in our trees based on standard cross-validation. However, this leaves out a significant fraction of the data in model construction and does not give error estimates on individual rules. Therefore, we present alternative methods to estimate the error in particular rules.
Collapse
Affiliation(s)
- P Bertone
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Abstract
The exponentially increased sequence information on major histocompatibility complex (MHC) alleles points to the existence of a high degree of polymorphism within them. To understand the functional consequences of MHC alleles, 36 nonredundant MHC-peptide complexes in the protein data bank (PDB) were examined. Induced fit molecular recognition patterns such as those in MHC-peptide complexes are governed by numerous rules. The 36 complexes were clustered into 19 subgroups based on allele specificity and peptide length. The subgroups were further analyzed for identifying common features in MHC-peptide binding pattern. The four major observations made during the investigation were: (1) the positional preference of peptide residues defined by percentage burial upon complex formation is shown for all the 19 subgroups and the burial profiles within entries in a given subgroup are found to be similar; (2) in class I specific 8- and 9-mer peptides, the fourth residue is consistently solvent exposed, however this observation is not consistent in class I specific 10-mer peptides; (3) an anchor-shift in positional preference is observed towards the C terminal as the peptide length increases in class II specific peptides; and (4) peptide backbone atoms are proportionately dominant at the MHC-peptide interface.
Collapse
Affiliation(s)
- P Kangueane
- BioInformatics Centre, National University of Singapore, Singapore.
| | | | | | | |
Collapse
|
20
|
Yokoyama S, Matsuo Y, Hirota H, Kigawa T, Shirouzu M, Kuroda Y, Kurumizaka H, Kawaguchi S, Ito Y, Shibata T, Kainosho M, Nishimura Y, Inoue Y, Kuramitsu S. Structural genomics projects in Japan. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2001; 73:363-76. [PMID: 11063781 DOI: 10.1016/s0079-6107(00)00012-2] [Citation(s) in RCA: 46] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- S Yokoyama
- RIKEN Genomic Sciences Center, 1-7-22 Suehiro-cho, Tsurumi-ku, 230-0045, Yokohama, Japan.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Martí-Renom MA, Stuart AC, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. ANNUAL REVIEW OF BIOPHYSICS AND BIOMOLECULAR STRUCTURE 2001; 29:291-325. [PMID: 10940251 DOI: 10.1146/annurev.biophys.29.1.291] [Citation(s) in RCA: 2376] [Impact Index Per Article: 99.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Comparative modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. The number of protein sequences that can be modeled and the accuracy of the predictions are increasing steadily because of the growth in the number of known protein structures and because of the improvements in the modeling software. Further advances are necessary in recognizing weak sequence-structure similarities, aligning sequences with structures, modeling of rigid body shifts, distortions, loops and side chains, as well as detecting errors in a model. Despite these problems, it is currently possible to model with useful accuracy significant parts of approximately one third of all known protein sequences. The use of individual comparative models in biology is already rewarding and increasingly widespread. A major new challenge for comparative modeling is the integration of it with the torrents of data from genome sequencing projects as well as from functional and structural genomics. In particular, there is a need to develop an automated, rapid, robust, sensitive, and accurate comparative modeling pipeline applicable to whole genomes. Such large-scale modeling is likely to encourage new kinds of applications for the many resulting models, based on their large number and completeness at the level of the family, organism, or functional network.
Collapse
Affiliation(s)
- M A Martí-Renom
- Laboratories of Molecular Biophysics, Pels Family Center for Biochemistry and Structural Biology, Rockefeller University, New York, NY 10021, USA
| | | | | | | | | | | |
Collapse
|
22
|
Linial M, Yona G. Methodologies for target selection in structural genomics. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2001; 73:297-320. [PMID: 11063777 DOI: 10.1016/s0079-6107(00)00011-0] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
As the number of complete genomes that have been sequenced keeps growing, unknown areas of the protein space are revealed and new horizons open up. Most of this information will be fully appreciated only when the structural information about the encoded proteins becomes available. The goal of structural genomics is to direct large-scale efforts of protein structure determination, so as to increase the impact of these efforts. This review focuses on current approaches in structural genomics aimed at selecting representative proteins as targets for structure determination. We will discuss the concept of representative structures/folds, the current methodologies for identifying those proteins, and computational techniques for identifying proteins which are expected to adopt new structural folds.
Collapse
Affiliation(s)
- M Linial
- Department of Biological Chemistry, Institute of Life Sciences, Hebrew University, 91904, Jerusalem, Israel.
| | | |
Collapse
|
23
|
Kuroda Y, Tani K, Matsuo Y, Yokoyama S. Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics. Protein Sci 2000; 9:2313-21. [PMID: 11206052 PMCID: PMC2144534 DOI: 10.1110/ps.9.12.2313] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.
Collapse
Affiliation(s)
- Y Kuroda
- Protein Research Group, Genomic Sciences Center, The Institute of Physical and Chemical Research (RIKEN), Yokohama, Kanagawa, Japan.
| | | | | | | |
Collapse
|
24
|
Abstract
We describe a genome annotation service provided by the Entrez browser, http://www.ncbi.nlm.nih.gov/entrez. All protein products identified in fully sequenced microbial genomes have been compared with proteins with known 3-D structure by use of the BLAST sequence comparison algorithm. For the approximately 20% of genome proteins in which unambiguous sequence similarity is detected, Entrez provides a link from the gene product to its predicted structure. The service uses the Cn3D molecular graphics viewer to present a 3-D view of the known structure, together with an alignment display mapping conserved residues from the genome protein onto the known structure. Using an example from Aeropyrum pernix, we illustrate how mapping to a 3-D structure can confirm predictions of biological function.
Collapse
Affiliation(s)
- Y Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | |
Collapse
|
25
|
Abstract
As the number of completely sequenced genomes rapidly increases, the postgenomic problem of gene function identification becomes ever more pressing. Predicting the structures of proteins encoded by genes of interest is one possible means to glean subtle clues as to the functions of these proteins. There are limitations to this approach to gene identification and a survey of the expected reliability of different protein structure prediction techniques has been undertaken.
Collapse
Affiliation(s)
- D T Jones
- Department of Biological Sciences, Brunel University, Uxbridge, UB8 3PH, UK.
| |
Collapse
|
26
|
Pawłowski K, Rychlewski L, Reed JC, Godzik A. From fold to function predictions: an apoptosis regulator protein BID. COMPUTERS & CHEMISTRY 2000; 24:511-7. [PMID: 10816020 DOI: 10.1016/s0097-8485(99)00081-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
With the rapidly increasing pace of genome sequencing projects and the resulting flood of predicted amino acid sequences of uncharacterized proteins, protein sequence analysis, and in particular, protein structure prediction is quickly gaining in importance. Prediction algorithms can be used for preliminary annotation of newly sequenced proteins and, at least in some cases, provide insights into their function and specific mode of action. Such annotations for several microbial genomes were performed by several groups and placed in public domain for evaluation. An example presented in this work comes from a related project of structural and functional predictions for proteins involved in the process of controlled cell death (apoptosis). The BID protein belongs to an important class of regulators of apoptosis identified by short sequence motifs. Here, several fold prediction methods are used to build a series of three-dimensional models. Structure analysis of the models with reference to the biological data available allows selection of the most appropriate model. It is found that the most likely structural model of BID is built on the structure of Bcl-X(L). The model is discussed in terms of experimental data on specific proteolytic cleavage of BID and its effect on BID interactions with other proteins and membranes.
Collapse
Affiliation(s)
- K Pawłowski
- The Burnham Institute, La Jolla, CA 92307, USA
| | | | | | | |
Collapse
|
27
|
Skolnick J, Fetrow JS, Kolinski A. Structural genomics and its importance for gene function analysis. Nat Biotechnol 2000; 18:283-7. [PMID: 10700142 DOI: 10.1038/73723] [Citation(s) in RCA: 161] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Structural genomics projects aim to solve the experimental structures of all possible protein folds. Such projects entail a conceptual shift from traditional structural biology in which structural information is obtained on known proteins to one in which the structure of a protein is determined first and the function assigned only later. Whereas the goal of converting protein structure into function can be accomplished by traditional sequence motif-based approaches, recent studies have shown that assignment of a protein's biochemical function can also be achieved by scanning its structure for a match to the geometry and chemical identity of a known active site. Importantly, this approach can use low-resolution structures provided by contemporary structure prediction methods. When applied to genomes, structural information (either experimental or predicted) is likely to play an important role in high-throughput function assignment.
Collapse
Affiliation(s)
- J Skolnick
- Laboratory of Computational Genomics, The Danforth Plant Science Center, 893 N, Warson Rd., St. Louis, MO 63141, USA.
| | | | | |
Collapse
|
28
|
Reichert J, Jabs A, Slickers P, Sühnel J. The IMB Jena Image Library of biological macromolecules. Nucleic Acids Res 2000; 28:246-9. [PMID: 10592237 PMCID: PMC102466 DOI: 10.1093/nar/28.1.246] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/1999] [Revised: 10/18/1999] [Accepted: 10/18/1999] [Indexed: 11/13/2022] Open
Abstract
The IMB Jena Image Library of Biological Macro-molecules (http://www. imb-jena.de/IMAGE.html ) is aimed at a better dissemination of information on three-dimensional biopolymer structures with an emphasis on visualization and analysis. It provides access to all structure entries deposited at the Protein Data Bank (PDB) and Nucleic Acid Database (NDB). By combining automatic and manual processing it is possible to keep pace with the rapidly growing number of known biopolymer structures and to provide, for selected entries, information not available from automatic procedures. Each entry page contains basic information on the structure, various visualization and analysis tools as well as links to other databases. The visualization techniques adopted include static mono/stereo raster or vector graphics representations, virtual reality modeling (VRML), RasMol/Chime scripts and Java applets. A helix and bending analysis tool provides consistent information on about 750 DNA and RNA duplex structures. Access to metal-containing PDB entries is possible via the Periodic Table of Elements. Finally, general information on amino acids, cis -peptide bonds, structural elements in proteins, base pairs, nucleic acid model conformations and experimental methods for biopolymer structure determination is provided.
Collapse
Affiliation(s)
- J Reichert
- Biocomputing, Institut für Molekulare Biotechnologie, Postfach 100813, D-07708 Jena, Germany
| | | | | | | |
Collapse
|
29
|
Abstract
Protein crystallography has become a major technique for understanding cellular processes. This has come about through great advances in the technology of data collection and interpretation, particularly the use of synchrotron radiation. The ability to express eukaryotic genes in Escherichia coli is also important. Analysis of known structures shows that all proteins are built from about 1000 primeval folds. The collection of all primeval folds provides a basis for predicting structure from sequence. At present about 450 are known. Of the presently sequenced genomes only a fraction can be related to known proteins on the basis of sequence alone. Attempts are being made to determine all (or as many as possible) of the structures from some bacterial genomes in the expectation that structure will point to function more reliably than does sequence. Membrane proteins present a special problem. The next 20 years may see the experimental determination of another 40,000 protein structures. This will make considerable demands on synchrotron sources and will require many more biochemists than are currently available. The availability of massive structure databases will alter the way biochemistry is done.
Collapse
Affiliation(s)
- K C Holmes
- Max-Planck-Institut für medizinische Forschung, Heidelberg, Germany.
| |
Collapse
|
30
|
|
31
|
Paw?owski K, Zhang B, Rychlewski L, Godzik A. TheHelicobacter pylori genome: From sequence analysis to structural and functional predictions. Proteins 1999. [DOI: 10.1002/(sici)1097-0134(19990701)36:1<20::aid-prot2>3.0.co;2-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
32
|
Abstract
New computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.
Collapse
Affiliation(s)
- S A Teichmann
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
| | | | | |
Collapse
|