151
|
Abstract
We identify and describe a set of tools readily available for integral membrane protein prediction. These tools address two problems: finding potential transmembrane proteins in a pool of new sequences, and identifying their transmembrane regions. All methods involve comparing the query protein against one or more target models. In the simplest of these, the target "model" is another protein sequence, while the more elaborate methods group together the entire set of t ansmembrane helical or transmembrane beta-barrel proteins. In general, prediction accuracy either in identifying new integral membrane proteins or transmembrane regions of known integral membrane proteins depends strongly on how closely the query fits the model. Because of this, the best approach is an opportunistic one: submit the protein of interest to all methods and choose the results with the highest confidence scores.
Collapse
Affiliation(s)
- Henry Bigelow
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | |
Collapse
|
152
|
Liu G, Forouhar F, Eletsky A, Atreya HS, Aramini JM, Xiao R, Huang YJ, Abashidze M, Seetharaman J, Liu J, Rost B, Acton T, Montelione GT, Hunt JF, Szyperski T. NMR and X-RAY structures of human E2-like ubiquitin-fold modifier conjugating enzyme 1 (UFC1) reveal structural and functional conservation in the metazoan UFM1-UBA5-UFC1 ubiquination pathway. ACTA ACUST UNITED AC 2008; 10:127-36. [PMID: 19101823 DOI: 10.1007/s10969-008-9054-7] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2008] [Accepted: 11/28/2008] [Indexed: 11/25/2022]
Abstract
For cell regulation, E2-like ubiquitin-fold modifier conjugating enzyme 1 (Ufc1) is involved in the transfer of ubiquitin-fold modifier 1 (Ufm1), a ubiquitin like protein which is activated by E1-like enzyme Uba5, to various target proteins. Thereby, Ufc1 participates in the very recently discovered Ufm1-Uba5-Ufc1 ubiquination pathway which is found in metazoan organisms. The structure of human Ufc1 was solved by using both NMR spectroscopy and X-ray crystallography. The complementary insights obtained with the two techniques provided a unique basis for understanding the function of Ufc1 at atomic resolution. The Ufc1 structure consists of the catalytic core domain conserved in all E2-like enzymes and an additional N-terminal helix. The active site Cys(116), which forms a thio-ester bond with Ufm1, is located in a flexible loop that is highly solvent accessible. Based on the Ufc1 and Ufm1 NMR structures, a model could be derived for the Ufc1-Ufm1 complex in which the C-terminal Gly(83) of Ufm1 may well form the expected thio-ester with Cys(116), suggesting that Ufm1-Ufc1 functions as described for other E1-E2-E3 machineries. alpha-helix 1 of Ufc1 adopts different conformations in the crystal and in solution, suggesting that this helix plays a key role to mediate specificity.
Collapse
Affiliation(s)
- Gaohua Liu
- Department of Chemistry, Northeast Structural Genomics Consortium, The State University of New York at Buffalo, Buffalo, NY 14260, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
153
|
Ofran Y, Schlessinger A, Rost B. Automated Identification of Complementarity Determining Regions (CDRs) Reveals Peculiar Characteristics of CDRs and B Cell Epitopes. J Immunol 2008; 181:6230-5. [DOI: 10.4049/jimmunol.181.9.6230] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
154
|
Abstract
Many non-synonymous single nucleotide polymorphisms (nsSNPs) in humans are suspected to impact protein function. Here, we present a publicly available server implementation of the method SNAP (screening for non-acceptable polymorphisms) that predicts the functional effects of single amino acid substitutions. SNAP identifies over 80% of the non-neutral mutations at 77% accuracy and over 76% of the neutral mutations at 80% accuracy at its default threshold. Each prediction is associated with a reliability index that correlates with accuracy and thereby enables experimentalists to zoom into the most promising predictions.
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | | | |
Collapse
|
155
|
Abstract
Motivation: Microarray expression data reveal functionally associated proteins. However, most proteins that are associated are not actually in direct physical contact. Predicting physical interactions directly from microarrays is both a challenging and important task that we addressed by developing a novel machine learning method optimized for this task. Results: We validated our support vector machine-based method on several independent datasets. At the same levels of accuracy, our method recovered more experimentally observed physical interactions than a conventional correlation-based approach. Pairs predicted by our method to very likely interact were close in the overall network of interaction, suggesting our method as an aid for functional annotation. We applied the method to predict interactions in yeast (Saccharomyces cerevisiae). A Gene Ontology function annotation analysis and literature search revealed several probable and novel predictions worthy of future experimental validation. We therefore hope our new method will improve the annotation of interactions as one component of multi-source integrated systems. Contact:ts2186@columbia.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ta-Tsen Soong
- Columbia University Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA.
| | | | | |
Collapse
|
156
|
Forouhar F, Neely H, Hussain M, Xiao R, Liu J, Acton T, Rost B, Montelione G, Hunt J. Crystal structure of RimO from Thermotoga maritima. Acta Crystallogr A 2008. [DOI: 10.1107/s0108767308088399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
157
|
Aramini JM, Rossi P, Huang YJ, Zhao L, Jiang M, Maglaqui M, Xiao R, Locke J, Nair R, Rost B, Acton TB, Inouye M, Montelione GT. Solution NMR Structure of the NlpC/P60 Domain of Lipoprotein Spr from Escherichia coli: Structural Evidence for a Novel Cysteine Peptidase Catalytic Triad. Biochemistry 2008; 47:9715-7. [DOI: 10.1021/bi8010779] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- James M. Aramini
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Paolo Rossi
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Yuanpeng J. Huang
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Li Zhao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Mei Jiang
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Melissa Maglaqui
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Rong Xiao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Jessica Locke
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Rajesh Nair
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Burkhard Rost
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Thomas B. Acton
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Masayori Inouye
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| | - Gaetano T. Montelione
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, New Jersey 08854, and Northeast Structural Genomics Consortium
| |
Collapse
|
158
|
Abstract
MOTIVATION Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico. RESULTS Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identi.ed 70% of the hot spots (>or=1 kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been con.rmed in the literature, others await experimental veri.cation, and our method is ready to aid in the design of in vitro mutagenesis. AVAILABILITY ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry Molecular Biophysics, Columbia University, 630 West 168th St, New York, NY 10032, USA.
| | | |
Collapse
|
159
|
Abstract
Motivation: A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences. Results: We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a very popular and effective method could be used to identify significantly more relevant similarities among protein sequences. Availability:http://www.rostlab.org/services/consensus/ Contact:dariusz@mit.edu
Collapse
Affiliation(s)
- Dariusz Przybylski
- Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA.
| | | |
Collapse
|
160
|
Abstract
Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with nonregular secondary structure (NORS regions) differ significantly from regular, well-structured loops, and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50%–70% of all worm proteins observed to have more than seven protein–protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks. The details of protein structures are important for function. Regions that do not adopt any regular structure in isolation (natively unstructured or disordered regions) initially appeared as a curious exception to this structure–function paradigm. It has become increasingly clear that unstructured regions are fundamental to many roles and that they are particularly important for multicellular organisms. Structural biology is just beginning to apprehend the stunning diversity of these roles. Here, we focused on unstructured regions dominated by a particular type of loop, namely the natively unstructured one. We developed a method that succeeded in the distinction between well-structured and natively unstructured loops. For the development, we did not use any experimental data for unstructured regions; when tested on experimental data, the method performed surprisingly well. Due to its different premises, the method captured very different aspects of unstructured regions than other methods that we tested. We applied the new method to two different problems. The first was the identification of proteins that may be difficult targets for structure determination. The second was the identification of worm proteins that have many interaction partners (more than seven) and unstructured regions. Surprisingly, we found unstructured regions of the loopy type in more than 50% of all the promiscuous worm proteins.
Collapse
Affiliation(s)
- Avner Schlessinger
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.
| | | | | |
Collapse
|
161
|
Abstract
Protein–protein interactions, a key to almost any biological process, are mediated by molecular mechanisms that are not entirely clear. The study of these mechanisms often focuses on all residues at protein–protein interfaces. However, only a small subset of all interface residues is actually essential for recognition or binding. Commonly referred to as “hotspots,” these essential residues are defined as residues that impede protein–protein interactions if mutated. While no in silico tool identifies hotspots in unbound chains, numerous prediction methods were designed to identify all the residues in a protein that are likely to be a part of protein–protein interfaces. These methods typically identify successfully only a small fraction of all interface residues. Here, we analyzed the hypothesis that the two subsets correspond (i.e., that in silico methods may predict few residues because they preferentially predict hotspots). We demonstrate that this is indeed the case and that we can therefore predict directly from the sequence of a single protein which residues are interaction hotspots (without knowledge of the interaction partner). Our results suggested that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hotspots from sequence enables the annotation and analysis of protein–protein interaction hotspots in entire organisms and thus may benefit function prediction and drug development. The server for prediction is available at http://www.rostlab.org/services/isis. Interactions between proteins underlie all biological processes. Hence, to fully understand or to control biological processes we need to unravel the principles of protein interactions. The quest for these principles has focused predominantly on the entire interfaces between two interacting proteins. However, it has been shown that only few of the interface residues are essential for the recognition and binding to other proteins. The identification of these residues, commonly referred to as binding “hotspots,” is a first step toward understanding the function of proteins and studying their interactions. Experimentally, hotspots could be identified by mutating single residues—an expensive and laborious procedure that is not applicable on a large scale. Here, we show that it is possible to identify protein interaction hotspots computationally on a large scale based on the amino acid sequence of a single protein, without requiring the knowledge of its interaction partner. Our results suggest that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hotspots from sequence enables the annotation and analysis of protein–protein interaction hotspots in an entire organism and thus may benefit function prediction and drug development.
Collapse
Affiliation(s)
- Yanay Ofran
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.
| | | |
Collapse
|
162
|
Aramini JM, Sharma S, Huang YJ, Swapna GVT, Ho CK, Shetty K, Cunningham K, Ma LC, Zhao L, Owens LA, Jiang M, Xiao R, Liu J, Baran MC, Acton TB, Rost B, Montelione GT. Solution NMR structure of the SOS response protein YnzC from Bacillus subtilis. Proteins 2008; 72:526-30. [PMID: 18431750 DOI: 10.1002/prot.22064] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- James M Aramini
- Department of Molecular Biology and Biochemistry, Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
163
|
Lippi M, Passerini A, Punta M, Rost B, Frasconi P. MetalDetector: a web server for predicting metal-binding sites and disulfide bridges in proteins from sequence. Bioinformatics 2008; 24:2094-5. [PMID: 18635571 DOI: 10.1093/bioinformatics/btn371] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED The web server MetalDetector classifies histidine residues in proteins into one of two states (free or metal bound) and cysteines into one of three states (free, metal bound or disulfide bridged). A decision tree integrates predictions from two previously developed methods (DISULFIND and Metal Ligand Predictor). Cross-validated performance assessment indicates that our server predicts disulfide bonding state at 88.6% precision and 85.1% recall, while it identifies cysteines and histidines in transition metal-binding sites at 79.9% precision and 76.8% recall, and at 60.8% precision and 40.7% recall, respectively. AVAILABILITY Freely available at http://metaldetector.dsi.unifi.it. SUPPLEMENTARY INFORMATION Details and data can be found at http://metaldetector.dsi.unifi.it/help.php.
Collapse
Affiliation(s)
- Marco Lippi
- Dipartimento di Sistemi e Informatica, Machine Learning and Neural Networks Group, Università degli Studi di Firenze, Via di Santa Marta 3, 50139 Firenze, Italy
| | | | | | | | | |
Collapse
|
164
|
Linial M, Mesirov JP, Morrison McKay BJ, Rost B. ISMB 2008 Toronto. PLoS Comput Biol 2008; 4:e1000094. [PMID: 18584023 PMCID: PMC2427177 DOI: 10.1371/journal.pcbi.1000094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Michal Linial
- International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America
- Sudarsky Center, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Jill P. Mesirov
- International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - B. J. Morrison McKay
- International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| | - Burkhard Rost
- International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
| |
Collapse
|
165
|
Singarapu KK, Xiao R, Acton T, Rost B, Montelione GT, Szyperski T. NMR structure of the peptidyl-tRNA hydrolase domain from Pseudomonas syringae expands the structural coverage of the hydrolysis domains of class 1 peptide chain release factors. Proteins 2008; 71:1027-31. [PMID: 18247350 DOI: 10.1002/prot.21947] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Kiran Kumar Singarapu
- Department of Chemistry, State University of New York at Buffalo, Buffalo, New York 14260-3000, USA
| | | | | | | | | | | |
Collapse
|
166
|
Trott O, Siggers K, Rost B, Palmer AG. Protein conformational flexibility prediction using machine learning. J Magn Reson 2008; 192:37-47. [PMID: 18313957 PMCID: PMC2413295 DOI: 10.1016/j.jmr.2008.01.011] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2006] [Revised: 12/15/2007] [Accepted: 01/26/2008] [Indexed: 05/26/2023]
Abstract
Using a data set of 16 proteins, a neural network has been trained to predict backbone 15N generalized order parameters from the three-dimensional structures of proteins. The final network parameterization contains six input features. The average prediction accuracy, as measured by the Pearson's correlation coefficient between experimental and predicted values of the square of the generalized order parameter is >0.70. Predicted order parameters for non-terminal amino acid residues depends most strongly on the local packing density and the probability that the residue is located in regular secondary structure.
Collapse
Affiliation(s)
| | | | | | - Arthur G. Palmer
- * Corresponding author. Fax: (212) 305-6949. E-mail address: (A. G. Palmer)
| |
Collapse
|
167
|
Abstract
This paper is an introduction to the supplemental issue of the journal PROTEINS, dedicated to the seventh CASP experiment to assess the state of the art in protein structure prediction. The paper describes the conduct of the experiment, the categories of prediction included, and outlines the evaluation and assessment procedures. Highlights are improvements in model accuracy relative to that obtainable from knowledge of a single best template structure; convergence of the accuracy of models produced by automatic servers toward that produced by human modeling teams; the emergence of methods for predicting the quality of models; and rapidly increasing practical applications of the methods.
Collapse
Affiliation(s)
- John Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA.
| | | | | | | | | | | |
Collapse
|
168
|
Abstract
Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its "function." One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer's disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC Department of Biochemistry and Molecular Biophysics and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA
| | | |
Collapse
|
169
|
Aramini JM, Huang YJ, Swapna GVT, Cort JR, Rajan PK, Xiao R, Shastry R, Acton TB, Liu J, Rost B, Kennedy MA, Montelione GT. Solution NMR structure of Escherichia coli ytfP expands the structural coverage of the UPF0131 protein domain family. Proteins 2007; 68:789-95. [PMID: 17523190 DOI: 10.1002/prot.21450] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- James M Aramini
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, New Jersey 08854, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
170
|
Abstract
MOTIVATION Thousands of proteins are known to bind to DNA; for most of them the mechanism of action and the residues that bind to DNA, i.e. the binding sites, are yet unknown. Experimental identification of binding sites requires expensive and laborious methods such as mutagenesis and binding essays. Hence, such studies are not applicable on a large scale. If the 3D structure of a protein is known, it is often possible to predict DNA-binding sites in silico. However, for most proteins, such knowledge is not available. RESULTS It has been shown that DNA-binding residues have distinct biophysical characteristics. Here we demonstrate that these characteristics are so distinct that they enable accurate prediction of the residues that bind DNA directly from amino acid sequence, without requiring any additional experimental or structural information. In a cross-validation based on the largest non-redundant dataset of high-resolution protein-DNA complexes available today, we found that 89% of our predictions are confirmed by experimental data. Thus, it is now possible to identify DNA-binding sites on a proteomic scale even in the absence of any experimental data or 3D-structural information. AVAILABILITY http://cubic.bioc.columbia.edu/services/disis.
Collapse
Affiliation(s)
- Yanay Ofran
- Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA.
| | | | | |
Collapse
|
171
|
Abstract
MOTIVATION Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. RESULTS Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75% at a recall (coverage) of 76% after training on a corpus, which was used before by other groups and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. AVAILABILITY http://cubic.bioc.columbia.edu/services/nlprot/
Collapse
Affiliation(s)
- Sven Mika
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
172
|
Abstract
MOTIVATION Natively unstructured (also dubbed intrinsically disordered) regions in proteins lack a defined 3D structure under physiological conditions and often adopt regular structures under particular conditions. Proteins with such regions are overly abundant in eukaryotes, they may increase functional complexity of organisms and they usually evade structure determination in the unbound form. Low propensity for the formation of internal residue contacts has been previously used to predict natively unstructured regions. RESULTS We combined PROFcon predictions for protein-specific contacts with a generic pairwise potential to predict unstructured regions. This novel method, Ucon, outperformed the best available methods in predicting proteins with long unstructured regions. Furthermore, Ucon correctly identified cases missed by other methods. By computing the difference between predictions based on specific contacts (approach introduced here) and those based on generic potentials (realized in other methods), we might identify unstructured regions that are involved in protein-protein binding. We discussed one example to illustrate this ambitious aim. Overall, Ucon added quality and an orthogonal aspect that may help in the experimental study of unstructured regions in network hubs. AVAILABILITY http://www.predictprotein.org/submit_ucon.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Avner Schlessinger
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.
| | | | | |
Collapse
|
173
|
|
174
|
Abstract
Many genetic variations are single nucleotide polymorphisms (SNPs). Non-synonymous SNPs are ‘neutral’ if the resulting point-mutated protein is not functionally discernible from the wild type and ‘non-neutral’ otherwise. The ability to identify non-neutral substitutions could significantly aid targeting disease causing detrimental mutations, as well as SNPs that increase the fitness of particular phenotypes. Here, we introduced comprehensive data sets to assess the performance of methods that predict SNP effects. Along we introduced SNAP (screening for non-acceptable polymorphisms), a neural network-based method for the prediction of the functional effects of non-synonymous SNPs. SNAP needs only sequence information as input, but benefits from functional and structural annotations, if available. In a cross-validation test on over 80 000 mutants, SNAP identified 80% of the non-neutral substitutions at 77% accuracy and 76% of the neutral substitutions at 80% accuracy. This constituted an important improvement over other methods; the improvement rose to over ten percentage points for mutants for which existing methods disagreed. Possibly even more importantly SNAP introduced a well-calibrated measure for the reliability of each prediction. This measure will allow users to focus on the most accurate predictions and/or the most severe effects. Available at http://www.rostlab.org/services/SNAP
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th St., New York, NY 10032, USA.
| | | |
Collapse
|
175
|
Abstract
We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. In general, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, 1130 St. Nicholas Ave., New York, NY 10032, USA
| | | | | | | | | | | |
Collapse
|
176
|
Abstract
Sequence alignments may be the most fundamental computational resource for molecular biology. The best methods that identify sequence relatedness through profile–profile comparisons are much slower and more complex than sequence–sequence and sequence–profile comparisons such as, respectively, BLAST and PSI-BLAST. Families of related genes and gene products (proteins) can be represented by consensus sequences that list the nucleic/amino acid most frequent at each sequence position in that family. Here, we propose a novel approach for consensus-sequence-based comparisons. This approach improved searches and alignments as a standard add-on to PSI-BLAST without any changes of code. Improvements were particularly significant for more difficult tasks such as the identification of distant structural relations between proteins and their corresponding alignments. Despite the fact that the improvements were higher for more divergent relations, they were consistent even at high accuracy/low error rates for non-trivially related proteins. The improvements were very easy to achieve; no parameter used by PSI-BLAST was altered and no single line of code changed. Furthermore, the consensus sequence add-on required relatively little additional CPU time. We discuss how advanced users of PSI-BLAST can immediately benefit from using consensus sequences on their local computers. We have also made the method available through the Internet (http://www.rostlab.org/services/consensus/).
Collapse
Affiliation(s)
- Dariusz Przybylski
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
177
|
Abstract
MOTIVATION Large-scale experiments reveal pairs of interacting proteins but leave the residues involved in the interactions unknown. These interface residues are essential for understanding the mechanism of interaction and are often desired drug targets. Reliable identification of residues that reside in protein-protein interface typically requires analysis of protein structure. Therefore, for the vast majority of proteins, for which there is no high-resolution structure, there is no effective way of identifying interface residues. RESULTS Here we present a machine learning-based method that identifies interacting residues from sequence alone. Although the method is developed using transient protein-protein interfaces from complexes of experimentally known 3D structures, it never explicitly uses 3D information. Instead, we combine predicted structural features with evolutionary information. The strongest predictions of the method reached over 90% accuracy in a cross-validation experiment. Our results suggest that despite the significant diversity in the nature of protein-protein interactions, they all share common basic principles and that these principles are identifiable from sequence alone.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC & North-East Structural Genomics Consortium, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
178
|
Berman HM, Burley SK, Chiu W, Sali A, Adzhubei A, Bourne PE, Bryant SH, Dunbrack RL, Fidelis K, Frank J, Godzik A, Henrick K, Joachimiak A, Heymann B, Jones D, Markley JL, Moult J, Montelione GT, Orengo C, Rossmann MG, Rost B, Saibil H, Schwede T, Standley DM, Westbrook JD. Outcome of a workshop on archiving structural models of biological macromolecules. Structure 2006; 14:1211-7. [PMID: 16955948 DOI: 10.1016/j.str.2006.06.005] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Helen M Berman
- The Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
179
|
Abstract
MOTIVATION The study of biological systems, pathways and processes relies increasingly on analyses of networks. Most often, such analyses focus on network topology, thereby treating all proteins or genes as identical, featureless nodes. Integrating molecular data and insights about the qualities of individual proteins into the analysis may enhance our ability to decipher biological pathways and processes. RESULTS Here, we introduce a novel platform for data integration that generates networks on the macro system-level, analyzes the molecular characteristics of each protein on the micro level, and then combines the two levels by using the molecular characteristics to assess networks. It also annotates the function and subcellular localization of each protein and displays the process on an image of a cell, rendering each protein in its respective cellular compartment. By thus visualizing the network in a cellular context we are able to analyze pathways and processes in a novel way. As an example, we use the system to analyze proteins implicated with Alzheimers disease and show how the integrated view corroborates previous observations and how it helps in the formulation of new hypotheses regarding the molecular underpinnings of the disease. AVAILABILITY http://www.rostlab.org/services/pinat.
Collapse
Affiliation(s)
- Yanay Ofran
- Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, New York, NY 10032, USA.
| | | | | | | | | | | |
Collapse
|
180
|
Abstract
PROFtmb predicts transmembrane beta-barrel (TMB) proteins in Gram-negative bacteria. For each query protein, PROFtmb provides both a Z-value indicating that the protein actually contains a membrane barrel, and a four-state per-residue labeling of upward- and downward-facing strands, periplasmic hairpins and extracellular loops. While most users submit individual proteins known to contain TMBs, some groups submit entire proteomes to screen for potential TMBs. Response time is about 4 min for a 500-residue protein. PROFtmb is a profile-based Hidden Markov Model (HMM) with an architecture mirroring the structure of TMBs. The per-residue accuracy on the 8-fold cross-validated testing set is 86% while whole-protein discrimination accuracy was 70 at 60% coverage. The PROFtmb web server includes all source code, training data and whole-proteome predictions from 78 Gram-negative bacterial genomes and is available freely and without registration at .
Collapse
Affiliation(s)
- Henry Bigelow
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA.
| | | |
Collapse
|
181
|
Passerini A, Punta M, Ceroni A, Rost B, Frasconi P. Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks. Proteins 2006; 65:305-16. [PMID: 16927295 DOI: 10.1002/prot.21135] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Accurate predictions of metal-binding sites in proteins by using sequence as the only source of information can significantly help in the prediction of protein structure and function, genome annotation, and in the experimental determination of protein structure. Here, we introduce a method for identifying histidines and cysteines that participate in binding of several transition metals and iron complexes. The method predicts histidines as being in either of two states (free or metal bound) and cysteines in either of three states (free, metal bound, or in disulfide bridges). The method uses only sequence information by utilizing position-specific evolutionary profiles as well as more global descriptors such as protein length and amino acid composition. Our solution is based on a two-stage machine-learning approach. The first stage consists of a support vector machine trained to locally classify the binding state of single histidines and cysteines. The second stage consists of a bidirectional recurrent neural network trained to refine local predictions by taking into account dependencies among residues within the same protein. A simple finite state automaton is employed as a postprocessing in the second stage in order to enforce an even number of disulfide-bonded cysteines. We predict histidines and cysteines in transition-metal-binding sites at 73% precision and 61% recall. We observe significant differences in performance depending on the ligand (histidine or cysteine) and on the metal bound. We also predict cysteines participating in disulfide bridges at 86% precision and 87% recall. Results are compared to those that would be obtained by using expert information as represented by PROSITE motifs and, for disulfide bonds, to state-of-the-art methods.
Collapse
Affiliation(s)
- Andrea Passerini
- Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy.
| | | | | | | | | |
Collapse
|
182
|
Abstract
Here we present the evaluation results of the Critical Assessment of Protein Structure Prediction (CASP6) contact prediction category. Contact prediction was assessed with standard measures well known in the field and the performance of specialist groups was evaluated alongside groups that submitted models with 3D coordinates. The evaluation was mainly focused on long range contact predictions for the set of new fold targets, although we analyzed predictions for all targets. Three groups with similar levels of accuracy and coverage performed a little better than the others. Comparisons of the predictions of the three best methods with those of CASP5/CAFASP3 suggested some improvement, although there were not enough targets in the comparisons to make this statistically significant.
Collapse
Affiliation(s)
- Osvaldo Graña
- Protein Design Group, Centro Nacional de Biotecnologia (CNB-CSIC), C/Darwin 3, Cantoblanco, Madrid, Spain
| | | | | | | | | | | | | | | |
Collapse
|
183
|
Abstract
This article is an introduction to the special issue of the journal Proteins, dedicated to the sixth CASP experiment to assess the state of the art in protein structure prediction. The article describes the conduct of the experiment and the categories of prediction included, and outlines the evaluation and assessment procedures. A brief summary of progress over the decade of CASP experiments is also provided.
Collapse
Affiliation(s)
- John Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA.
| | | | | | | | | |
Collapse
|
184
|
Abstract
Experimental high-throughput studies of protein–protein interactions are beginning to provide enough data for comprehensive computational studies. Today, about ten large data sets, each with thousands of interacting pairs, coarsely sample the interactions in fly, human, worm, and yeast. Another about 55,000 pairs of interacting proteins have been identified by more careful, detailed biochemical experiments. Most interactions are experimentally observed in prokaryotes and simple eukaryotes; very few interactions are observed in higher eukaryotes such as mammals. It is commonly assumed that pathways in mammals can be inferred through homology to model organisms, e.g. the experimental observation that two yeast proteins interact is transferred to infer that the two corresponding proteins in human also interact. Two pairs for which the interaction is conserved are often described as interologs. The goal of this investigation was a large-scale comprehensive analysis of such inferences, i.e. of the evolutionary conservation of interologs. Here, we introduced a novel score for measuring the overlap between protein–protein interaction data sets. This measure appeared to reflect the overall quality of the data and was the basis for our two surprising results from our large-scale analysis. Firstly, homology-based inferences of physical protein–protein interactions appeared far less successful than expected. In fact, such inferences were accurate only for extremely high levels of sequence similarity. Secondly, and most surprisingly, the identification of interacting partners through sequence similarity was significantly more reliable for protein pairs within the same organism than for pairs between species. Our analysis underlined that the discrepancies between different datasets are large, even when using the same type of experiment on the same organism. This reality considerably constrains the power of homology-based transfer of interactions. In particular, the experimental probing of interactions in distant model organisms has to be undertaken with some caution. More comprehensive images of protein–protein networks will require the combination of many high-throughput methods, including in silico inferences and predictions. http://www.rostlab.org/results/2006/ppi_homology/ The IntAct database contains about ten large-scale data sets of protein–protein interactions. Each set contains thousands of experimentally observed pair interactions. Most pairs were observed in yeast (Saccharomyces cerevisiae), fly (Drosophila melanogaster), and worm (Caenorhabditis elegans). These interactions are often perceived as model organisms in the sense that one can infer that two mouse proteins interact if one experimentally observes the two corresponding proteins in worm to interact. Here, the authors analyzed in detail how the sequence signals of physical protein–protein interactions are conserved. It is a common assumption that protein–protein interactions can easily be inferred through homology transfer from one model organism to another organism of interest. Here, the authors demonstrated that such homology transfers are only accurate at unexpectedly high levels of sequence identity. Even more surprisingly, homology transfers of protein–protein interactions are significantly more reliable for protein pairs from the same species than for two protein pairs from different organisms. The observation that interactions were much more conserved within than across species was valid for all levels of sequence similarity, i.e. for very similar as well as for more diverged interologs.
Collapse
Affiliation(s)
- Sven Mika
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.
| | | |
Collapse
|
185
|
Abstract
Structural flexibility has been associated with various biological processes such as molecular recognition and catalytic activity. In silico studies of protein flexibility have attempted to characterize and predict flexible regions based on simple principles. B-values derived from experimental data are widely used to measure residue flexibility. Here, we present the most comprehensive large-scale analysis of B-values. We used this analysis to develop a neural network-based method that predicts flexible-rigid residues from amino acid sequence. The system uses both global and local information (i.e., features from the entire protein such as secondary structure composition, protein length, and fraction of surface residues, and features from a local window of sequence-consecutive residues). The most important local feature was the evolutionary exchange profile reflecting sequence conservation in a family of related proteins. To illustrate its potential, we applied our method to 4 different case studies, each of which related our predictions to aspects of function. The first 2 were the prediction of regions that undergo conformational switches upon environmental changes (switch II region in Ras) and the prediction of surface regions, the rigidity of which is crucial for their function (tunnel in propeller folds). Both were correctly captured by our method. The third study established that residues in active sites of enzymes are predicted by our method to have unexpectedly low B-values. The final study demonstrated how well our predictions correlated with NMR order parameters to reflect motion. Our method had not been set up to address any of the tasks in those 4 case studies. Therefore, we expect that this method will assist in many attempts at inferring aspects of function.
Collapse
Affiliation(s)
- Avner Schlessinger
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
186
|
Abstract
RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000. There are two types of RNA: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which function as RNA molecules. Besides textbook examples such as tRNAs and rRNAs, non-coding RNAs have been found to carry out very diverse functions, from mRNA splicing and RNA modification to translational regulation. It has been estimated that non-coding RNAs make up the vast majority of transcription output of higher eukaryotes. Discriminating mRNA from ncRNA has become an important biological and computational problem. The authors describe a computational method based on a machine learning algorithm known as a support vector machine (SVM) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, secondary structure content, and protein alignment information. The method is applied to the dataset from the FANTOM3 large-scale mouse cDNA sequencing project; it identifies over 14,000 ncRNAs in mouse and estimates the total number of ncRNAs in the FANTOM3 data to be about 28,000.
Collapse
Affiliation(s)
- Jinfeng Liu
- Columbia University Bioinformatics Center, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America.
| | | | | |
Collapse
|
187
|
Abstract
Every entirely sequenced genome reveals 100 s to 1000 s of protein sequences for which the only annotation available is 'hypothetical protein'. Thus, in the human genome and in the genomes of pathogenic agents there could be 1000 s of potential, unexplored drug targets. Computational prediction of protein function can play a role in studying these targets. We shall review the challenges, research approaches and recently developed tools in the field of computational function-prediction and we will discuss the ways these issues can change the process of drug discovery.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
188
|
Abstract
Immunoglobulin molecules specifically recognize particular areas on the surface of proteins. These areas are commonly dubbed B-cell epitopes. The identification of epitopes in proteins is important both for the design of experiments and vaccines. Additionally, the interactions between epitopes and antibodies have often served as a model for protein-protein interactions. One of the main obstacles in creating a database of antigen-antibody interactions is the difficulty in distinguishing between antigenic and non-antigenic interactions. Antigenic interactions involve specific recognition sites on the antibody's surface, while non-antigenic interactions are between a protein and any other site on the antibody. To solve this problem, we performed a comparative analysis of all protein-antibody complexes for which structures have been experimentally determined. Additionally, we developed a semi-automated tool that identified the antigenic interactions within the known antigen-antibody complex structures. We compiled those interactions into Epitome, a database of structure-inferred antigenic residues in proteins. Epitome consists of all known antigen/antibody complex structures, a detailed description of the residues that are involved in the interactions, and their sequence/structure environments. Interactions can be visualized using an interface to Jmol. The database is available at http://www.rostlab.org/services/epitome/.
Collapse
Affiliation(s)
- Avner Schlessinger
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 1130 St Nicholas Avenue, room 804, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
189
|
Abstract
UNLABELLED The mobility of a residue on the protein surface is closely linked to its function. The identification of extremely rigid or flexible surface residues can therefore contribute information crucial for solving the complex problem of identifying functionally important residues in proteins. Mobility is commonly measured by B-value data from high-resolution three-dimensional X-ray structures. Few methods predict B-values from sequence. Here, we present PROFbval, the first web server to predict normalized B-values from amino acid sequence. The server handles amino acid sequences (or alignments) as input and outputs normalized B-value and two-state (flexible/rigid) predictions. The server also assigns a reliability index for each prediction. For example, PROFbval correctly identifies residues in active sites on the surface of enzymes as particularly rigid. AVAILABILITY http://www.rostlab.org/services/profbval CONTACT profbval@rostlab.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Avner Schlessinger
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | | | |
Collapse
|
190
|
Powers R, Mirkovic N, Goldsmith-Fischman S, Acton TB, Chiang Y, Huang YJ, Ma L, Rajan PK, Cort JR, Kennedy MA, Liu J, Rost B, Honig B, Murray D, Montelione GT. Solution structure of Archaeglobus fulgidis peptidyl-tRNA hydrolase (Pth2) provides evidence for an extensive conserved family of Pth2 enzymes in archea, bacteria, and eukaryotes. Protein Sci 2005; 14:2849-61. [PMID: 16251366 PMCID: PMC2253226 DOI: 10.1110/ps.051666705] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
The solution structure of protein AF2095 from the thermophilic archaea Archaeglobus fulgidis, a 123-residue (13.6-kDa) protein, has been determined by NMR methods. The structure of AF2095 is comprised of four alpha-helices and a mixed beta-sheet consisting of four parallel and anti-parallel beta-strands, where the alpha-helices sandwich the beta-sheet. Sequence and structural comparison of AF2095 with proteins from Homo sapiens, Methanocaldococcus jannaschii, and Sulfolobus solfataricus reveals that AF2095 is a peptidyl-tRNA hydrolase (Pth2). This structural comparison also identifies putative catalytic residues and a tRNA interaction region for AF2095. The structure of AF2095 is also similar to the structure of protein TA0108 from archaea Thermoplasma acidophilum, which is deposited in the Protein Data Bank but not functionally annotated. The NMR structure of AF2095 has been further leveraged to obtain good-quality structural models for 55 other proteins. Although earlier studies have proposed that the Pth2 protein family is restricted to archeal and eukaryotic organisms, the similarity of the AF2095 structure to human Pth2, the conservation of key active-site residues, and the good quality of the resulting homology models demonstrate a large family of homologous Pth2 proteins that are conserved in eukaryotic, archaeal, and bacterial organisms, providing novel insights in the evolution of the Pth and Pth2 enzyme families.
Collapse
Affiliation(s)
- Robert Powers
- Department of Chemistry, University of Nebraska-Lincoln, Lincoln, NE 68588, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
191
|
Snyder DA, Chen Y, Denissova NG, Acton T, Aramini JM, Ciano M, Karlin R, Liu J, Manor P, Rajan PA, Rossi P, Swapna GVT, Xiao R, Rost B, Hunt J, Montelione GT. Comparisons of NMR Spectral Quality and Success in Crystallization Demonstrate that NMR and X-ray Crystallography Are Complementary Methods for Small Protein Structure Determination. J Am Chem Soc 2005; 127:16505-11. [PMID: 16305237 DOI: 10.1021/ja053564h] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
X-ray crystallography and NMR spectroscopy provide the only sources of experimental data from which protein structures can be analyzed at high or even atomic resolution. The degree to which these methods complement each other as sources of structural knowledge is a matter of debate; it is often proposed that small proteins yielding high quality, readily analyzed NMR spectra are a subset of those that readily yield strongly diffracting crystals. We have examined the correlation between NMR spectral quality and success in structure determination by X-ray crystallography for 159 prokaryotic and eukaryotic proteins, prescreened to avoid proteins providing polydisperse and/or aggregated samples. This study demonstrates that, across this protein sample set, the quality of a protein's [15N-1H]-heteronuclear correlation (HSQC) spectrum recorded under conditions generally suitable for 3D structure determination by NMR, a key predictor of the ability to determine a structure by NMR, is not correlated with successful crystallization and structure determination by X-ray crystallography. These results, together with similar results of an independent study presented in the accompanying paper (Yee, et al., J. Am. Chem. Soc., accompanying paper), demonstrate that X-ray crystallography and NMR often provide complementary sources of structural data and that both methods are required in order to optimize success for as many targets as possible in large-scale structural proteomics efforts.
Collapse
Affiliation(s)
- David A Snyder
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, New Jersey 08854, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
192
|
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest ARR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, Chalk AM, Chiu KP, Choudhary V, Christoffels A, Clutterbuck DR, Crowe ML, Dalla E, Dalrymple BP, de Bono B, Della Gatta G, di Bernardo D, Down T, Engstrom P, Fagiolini M, Faulkner G, Fletcher CF, Fukushima T, Furuno M, Futaki S, Gariboldi M, Georgii-Hemming P, Gingeras TR, Gojobori T, Green RE, Gustincich S, Harbers M, Hayashi Y, Hensch TK, Hirokawa N, Hill D, Huminiecki L, Iacono M, Ikeo K, Iwama A, Ishikawa T, Jakt M, Kanapin A, Katoh M, Kawasawa Y, Kelso J, Kitamura H, Kitano H, Kollias G, Krishnan SPT, Kruger A, Kummerfeld SK, Kurochkin IV, Lareau LF, Lazarevic D, Lipovich L, Liu J, Liuni S, McWilliam S, Madan Babu M, Madera M, Marchionni L, Matsuda H, Matsuzawa S, Miki H, Mignone F, Miyake S, Morris K, Mottagui-Tabar S, Mulder N, Nakano N, Nakauchi H, Ng P, Nilsson R, Nishiguchi S, Nishikawa S, Nori F, Ohara O, Okazaki Y, Orlando V, Pang KC, Pavan WJ, Pavesi G, Pesole G, Petrovsky N, Piazza S, Reed J, Reid JF, Ring BZ, Ringwald M, Rost B, Ruan Y, Salzberg SL, Sandelin A, Schneider C, Schönbach C, Sekiguchi K, Semple CAM, Seno S, Sessa L, Sheng Y, Shibata Y, Shimada H, Shimada K, Silva D, Sinclair B, Sperling S, Stupka E, Sugiura K, Sultana R, Takenaka Y, Taki K, Tammoja K, Tan SL, Tang S, Taylor MS, Tegner J, Teichmann SA, Ueda HR, van Nimwegen E, Verardo R, Wei CL, Yagi K, Yamanishi H, Zabarovsky E, Zhu S, Zimmer A, Hide W, Bult C, Grimmond SM, Teasdale RD, Liu ET, Brusic V, Quackenbush J, Wahlestedt C, Mattick JS, Hume DA, Kai C, Sasaki D, Tomaru Y, Fukuda S, Kanamori-Katayama M, Suzuki M, Aoki J, Arakawa T, Iida J, Imamura K, Itoh M, Kato T, Kawaji H, Kawagashira N, Kawashima T, Kojima M, Kondo S, Konno H, Nakano K, Ninomiya N, Nishio T, Okada M, Plessy C, Shibata K, Shiraki T, Suzuki S, Tagami M, Waki K, Watahiki A, Okamura-Oho Y, Suzuki H, Kawai J, Hayashizaki Y. The transcriptional landscape of the mammalian genome. Science 2005; 309:1559-63. [PMID: 16141072 DOI: 10.1126/science.1112014] [Citation(s) in RCA: 2607] [Impact Index Per Article: 137.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
Collapse
|
193
|
Abstract
Here we introduce EVAcon, an automated web service that evaluates the performance of contact prediction servers. Currently, EVAcon is monitoring nine servers, four of which are specialized in contact prediction and five are general structure prediction servers. Results are compared for all newly determined experimental structures deposited into PDB (∼5–50 per week). EVAcon allows for a precise comparison of the results based on a system of common protein subsets and the commonly accepted evaluation criteria that are also used in the corresponding category of the CASP assessment. EVAcon is a new service added to the functionality of the EVA system for the continuous evaluation of protein structure prediction servers. The new service is accesible from any of the three EVA mirrors: PDG (CNB-CSIC, Madrid) (); CUBIC (Columbia University, NYC) (); and Sali Lab (UCSF, San Francisco) ().
Collapse
Affiliation(s)
| | - Volker A. Eyrich
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University650 West 168th Street BB217, New York, NY 10032, USA
| | | | - Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University650 West 168th Street BB217, New York, NY 10032, USA
| | - Alfonso Valencia
- To whom correspondence should be addressed. Tel: +34 91 585 4570; Fax: +34 91 585 4506;
| |
Collapse
|
194
|
|
195
|
Abstract
MOTIVATION Despite the continuing advance in the experimental determination of protein structures, the gap between the number of known protein sequences and structures continues to increase. Prediction methods can bridge this sequence-structure gap only partially. Better predictions of non-local contacts between residues could improve comparative modeling, fold recognition and could assist in the experimental structure determination. RESULTS Here, we introduced PROFcon, a novel contact prediction method that combines information from alignments, from predictions of secondary structure and solvent accessibility, from the region between two residues and from the average properties of the entire protein. In contrast to some other methods, PROFcon predicted short and long proteins at similar levels of accuracy. As expected, PROFcon was clearly less accurate when tested on sparse evolutionary profiles, that is, on families with few homologs. Prediction accuracy was highest for proteins belonging to the SCOP alpha/beta class. PROFcon compared favorably with state-of-the-art prediction methods at the CASP6 meeting. While the performance may still be perceived as low, our method clearly pushed the mark higher. Furthermore, predictions are already accurate enough to seed predictions of global features of protein structure.
Collapse
Affiliation(s)
- Marco Punta
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
196
|
Abstract
Folding rates of small single-domain proteins that fold through simple two-state kinetics can be estimated from details of the three-dimensional protein structure. Previously, predictions of secondary structure had been exploited to predict folding rates from sequence. Here, we estimate two-state folding rates from predictions of internal residue-residue contacts in proteins of unknown structure. Our estimate is based on the correlation between the folding rate and the number of predicted long-range contacts normalized by the square of the protein length. It is well known that long-range order derived from known structures correlates with folding rates. The surprise was that estimates based on very noisy contact predictions were almost as accurate as the estimates based on known contacts. On average, our estimates were similar to those previously published from secondary structure predictions. The combination of these methods that exploit different sources of information improved performance. It appeared that the combined method reliably distinguished fast from slow two-state folders.
Collapse
Affiliation(s)
- Marco Punta
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
197
|
Benach J, Edstrom WC, Lee I, Das K, Cooper B, Xiao R, Liu J, Rost B, Acton TB, Montelione GT, Hunt JF. The 2.35 Å structure of the TenA homolog fromPyrococcus furiosussupports an enzymatic function in thiamine metabolism. Acta Crystallogr D Biol Crystallogr 2005; 61:589-98. [PMID: 15858269 DOI: 10.1107/s0907444905005147] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2004] [Accepted: 02/15/2005] [Indexed: 11/11/2022]
Abstract
TenA (transcriptional enhancer A) has been proposed to function as a transcriptional regulator based on observed changes in gene-expression patterns when overexpressed in Bacillus subtilis. However, studies of the distribution of proteins involved in thiamine biosynthesis in different fully sequenced genomes have suggested that TenA may be an enzyme involved in thiamine biosynthesis, with a function related to that of the ThiC protein. The crystal structure of PF1337, the TenA homolog from Pyrococcus furiosus, is presented here. The protomer comprises a bundle of alpha-helices with a similar tertiary structure and topology to that of human heme oxygenase-1, even though there is no significant sequence homology. A solvent-sequestered cavity lined by phylogenetically conserved residues is found at the core of this bundle in PF1337 and this cavity is observed to contain electron density for 4-amino-5-hydroxymethyl-2-methylpyrimidine phosphate, the product of the ThiC enzyme. In contrast, the modestly acidic surface of PF1337 shows minimal levels of sequence conservation and a dearth of the basic residues that are typically involved in DNA binding in transcription factors. Without significant conservation of its surface properties, TenA is unlikely to mediate functionally important protein-protein or protein-DNA interactions. Therefore, the crystal structure of PF1337 supports the hypothesis that TenA homologs have an indirect effect in altering gene-expression patterns and function instead as enzymes involved in thiamine metabolism.
Collapse
Affiliation(s)
- Jordi Benach
- Department of Biological Sciences and Northeast Structural Genomics Consortium, 702A Fairchild Center, MC2434, Columbia University, New York, NY 10027, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
198
|
Abstract
The nuclear matrix (NM) is a structure resulting from the aggregation of proteins and RNA in the nucleus of eukaryotic cells; it is the 'sticky bit' that remains after aggressive DNAse digestion and salt extraction protocols. Owing to the important role of the NM in DNA replication, DNA transcription and RNA splicing, the expression pattern of NM proteins has become an important early indicator for numerous cancers/tumors. Recent descriptions of the NM structure distinguish between a network-like 'internal nuclear matrix' (INM) and a 'nuclear shell' that connects the INM to the inner and outer nuclear membranes. A cautious NM preparation protocol reveals a coat of proteins on top of the INM; these proteins are usually referred to as the 'nuclear matrix-associated proteins'. Here, we describe a new database (NMPdb at http://www.rostlab.org/db/NMPdb/) that currently contains details of 398 NM proteins. We collected these data through a semi-automated analysis of over 3000 scientific articles in PubMed. We could match these 398 proteins to 302 protein sequences in UniProt or GenBank. Our NMPdb repository annotates these links along with the following annotations: organism, cell type, PubMed identifier, sequence-based predictions of structural and functional features and for some entries the explicit sequence segment that is responsible for localization (nuclear matrix targeting signal).
Collapse
Affiliation(s)
- Sven Mika
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
199
|
Nair R, Rost B. Mimicking Cellular Sorting Improves Prediction of Subcellular Localization. J Mol Biol 2005; 348:85-100. [PMID: 15808855 DOI: 10.1016/j.jmb.2005.02.025] [Citation(s) in RCA: 237] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2004] [Revised: 02/08/2005] [Accepted: 02/09/2005] [Indexed: 11/24/2022]
Abstract
Predicting the native subcellular compartment of a protein is an important step toward elucidating its function. Here we introduce LOCtree, a hierarchical system combining support vector machines (SVMs) and other prediction methods. LOCtree predicts the subcellular compartment of a protein by mimicking the mechanism of cellular sorting and exploiting a variety of sequence and predicted structural features in its input. Currently LOCtree does not predict localization for membrane proteins, since the compositional properties of membrane proteins significantly differ from those of non-membrane proteins. While any information about function can be used by the system, we present estimates of performance that are valid when only the amino acid sequence of a protein is known. When evaluated on a non-redundant test set, LOCtree achieved sustained levels of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplast.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
| | | |
Collapse
|
200
|
|