1
|
Methods for the Refinement of Protein Structure 3D Models. Int J Mol Sci 2019; 20:ijms20092301. [PMID: 31075942 PMCID: PMC6539982 DOI: 10.3390/ijms20092301] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 04/24/2019] [Accepted: 05/07/2019] [Indexed: 12/25/2022] Open
Abstract
The refinement of predicted 3D protein models is crucial in bringing them closer towards experimental accuracy for further computational studies. Refinement approaches can be divided into two main stages: The sampling and scoring stages. Sampling strategies, such as the popular Molecular Dynamics (MD)-based protocols, aim to generate improved 3D models. However, generating 3D models that are closer to the native structure than the initial model remains challenging, as structural deviations from the native basin can be encountered due to force-field inaccuracies. Therefore, different restraint strategies have been applied in order to avoid deviations away from the native structure. For example, the accurate prediction of local errors and/or contacts in the initial models can be used to guide restraints. MD-based protocols, using physics-based force fields and smart restraints, have made significant progress towards a more consistent refinement of 3D models. The scoring stage, including energy functions and Model Quality Assessment Programs (MQAPs) are also used to discriminate near-native conformations from non-native conformations. Nevertheless, there are often very small differences among generated 3D models in refinement pipelines, which makes model discrimination and selection problematic. For this reason, the identification of the most native-like conformations remains a major challenge.
Collapse
|
2
|
Fukushima M. Constructing failure in big biology: The socio-technical anatomy of Japan's Protein 3000 Project. SOCIAL STUDIES OF SCIENCE 2016; 46:7-33. [PMID: 26983170 DOI: 10.1177/0306312715612146] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
This study focuses on the 5-year Protein 3000 Project launched in 2002, the largest biological project in Japan. The project aimed to overcome Japan's alleged failure to contribute fully to the Human Genome Project, by determining 3000 protein structures, 30 percent of the global target. Despite its achievement of this goal, the project was fiercely criticized in various sectors of society and was often branded an awkward failure. This article tries to solve the mystery of why such failure discourse was prevalent. Three explanatory factors are offered: first, because some goals were excluded during project development, there was a dynamic of failed expectations; second, structural genomics, while promoting collaboration with the international community, became an 'anti-boundary object', only the absence of which bound heterogeneous domestic actors; third, there developed an urgent sense of international competition in order to obtain patents on such structural information.
Collapse
|
3
|
Chandonia JM, Brenner S. Update on the pfam5000 strategy for selection of structural genomics targets. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2012; 2006:751-5. [PMID: 17282292 DOI: 10.1109/iembs.2005.1616523] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Structural Genomics is an international effort to determine the three-dimensional shapes of all important biological macromolecules, with a primary focus on proteins. Target proteins should be selected according to a strategy that is medically and biologically relevant, of good financial value, and tractable. In 2003, we presented the "Pfam5000" strategy, which involves selecting the 5,000 most important families from the Pfam database as sources for targets. In this update, we show that although both the Pfam database and the number of sequenced genomes have increased in size, the expected benefits of the Pfam5000 strategy have not changed substantially. Solving the structures of proteins from the 5,000 largest Pfam families would allow accurate fold assignment for approximately 65% of all prokaryotic proteins (covering 54% of residues) and 63% of eukaryotic proteins (42% of residues). Fewer than 2,300 of the largest families on this list remain to be solved, making the project feasible in the next five years given the expected throughput to be achieved in the production phase of the Protein Structure Initiative.
Collapse
Affiliation(s)
- J-M Chandonia
- Berkeley Structural Genomics Center, Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA (e-mail: )
| | | |
Collapse
|
4
|
Ellingson L, Zhang J. Protein surface matching by combining local and global geometric information. PLoS One 2012; 7:e40540. [PMID: 22815760 PMCID: PMC3398928 DOI: 10.1371/journal.pone.0040540] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2011] [Accepted: 06/12/2012] [Indexed: 01/01/2023] Open
Abstract
Comparison of the binding sites of proteins is an effective means for predicting protein functions based on their structure information. Despite the importance of this problem and much research in the past, it is still very challenging to predict the binding ligands from the atomic structures of protein binding sites. Here, we designed a new algorithm, TIPSA (Triangulation-based Iterative-closest-point for Protein Surface Alignment), based on the iterative closest point (ICP) algorithm. TIPSA aims to find the maximum number of atoms that can be superposed between two protein binding sites, where any pair of superposed atoms has a distance smaller than a given threshold. The search starts from similar tetrahedra between two binding sites obtained from 3D Delaunay triangulation and uses the Hungarian algorithm to find additional matched atoms. We found that, due to the plasticity of protein binding sites, matching the rigid body of point clouds of protein binding sites is not adequate for satisfactory binding ligand prediction. We further incorporated global geometric information, the radius of gyration of binding site atoms, and used nearest neighbor classification for binding site prediction. Tested on benchmark data, our method achieved a performance comparable to the best methods in the literature, while simultaneously providing the common atom set and atom correspondences.
Collapse
Affiliation(s)
- Leif Ellingson
- Department of Mathematics and Statistics, Texas Tech University, Lubbock, Texas, United States of America
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, Florida, United States of America
- * E-mail:
| |
Collapse
|
5
|
NMR-based structural biology of proteins in supercooled water. ACTA ACUST UNITED AC 2011; 12:1-7. [DOI: 10.1007/s10969-011-9111-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2011] [Accepted: 04/20/2011] [Indexed: 10/18/2022]
|
6
|
Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, Strausberg RL, Frazier M, Venter JC. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 2007; 5:e16. [PMID: 17355171 PMCID: PMC1821046 DOI: 10.1371/journal.pbio.0050016] [Citation(s) in RCA: 534] [Impact Index Per Article: 31.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Accepted: 08/15/2006] [Indexed: 02/04/2023] Open
Abstract
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature. The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the Sorcerer II Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature. The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.
Collapse
Affiliation(s)
- Shibu Yooseph
- J. Craig Venter Institute, Rockville, Maryland, United States of America.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Stumpff-Kane AW, Maksimiak K, Lee MS, Feig M. Sampling of near-native protein conformations during protein structure refinement using a coarse-grained model, normal modes, and molecular dynamics simulations. Proteins 2007; 70:1345-56. [PMID: 17876825 DOI: 10.1002/prot.21674] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Protein structure refinement from comparative models with the goal of predicting structures at near-experimental accuracy remains an unsolved problem. Structure refinement might be achieved with an iterative protocol where the most native-like structure from a set of decoys generated from an initial model in one cycle is used as the starting structure for the next cycle. Conformational sampling based on the coarse-grained SICHO model, atomic level of detail molecular dynamics simulations, and normal-mode analysis is compared in the context of such a protocol. All of the sampling methods can achieve significant refinement close to experimental structures, although the distribution of structures and the ability to reach native-like structures differs greatly. Implications for the practical application of such sampling methods and the requirements for scoring functions in an iterative refinement protocol are analyzed in the context of theoretical predictions for the distribution of protein-like conformations with a random sampling protocol.
Collapse
Affiliation(s)
- Andrew W Stumpff-Kane
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824-1319, USA
| | | | | | | |
Collapse
|
8
|
Chandonia JM, Kim SH. Structural proteomics of minimal organisms: conservation of protein fold usage and evolutionary implications. BMC STRUCTURAL BIOLOGY 2006; 6:7. [PMID: 16566839 PMCID: PMC1488858 DOI: 10.1186/1472-6807-6-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2005] [Accepted: 03/28/2006] [Indexed: 11/10/2022]
Abstract
BACKGROUND Determining the complete repertoire of protein structures for all soluble, globular proteins in a single organism has been one of the major goals of several structural genomics projects in recent years. RESULTS We report that this goal has nearly been reached for several "minimal organisms"--parasites or symbionts with reduced genomes--for which over 95% of the soluble, globular proteins may now be assigned folds, overall 3-D backbone structures. We analyze the structures of these proteins as they relate to cellular functions, and compare conservation of fold usage between functional categories. We also compare patterns in the conservation of folds among minimal organisms and those observed between minimal organisms and other bacteria. CONCLUSION We find that proteins performing essential cellular functions closely related to transcription and translation exhibit a higher degree of conservation in fold usage than proteins in other functional categories. Folds related to transcription and translation functional categories were also overrepresented in minimal organisms compared to other bacteria.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Sung-Hou Kim
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Chemistry, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
9
|
Chandonia JM, Brenner SE. Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Proteins 2006; 58:166-79. [PMID: 15521074 DOI: 10.1002/prot.20298] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Structural genomics is an international effort to determine the three-dimensional shapes of all important biological macromolecules, with a primary focus on proteins. Target proteins should be selected according to a strategy that is medically and biologically relevant, of good value, and tractable. As an option to consider, we present the "Pfam5000" strategy, which involves selecting the 5000 most important families from the Pfam database as sources for targets. We compare the Pfam5000 strategy to several other proposed strategies that would require similar numbers of targets. These strategies include complete solution of several small to moderately sized bacterial proteomes, partial coverage of the human proteome, and random selection of approximately 5000 targets from sequenced genomes. We measure the impact that successful implementation of these strategies would have upon structural interpretation of the proteins in Swiss-Prot, TrEMBL, and 131 complete proteomes (including 10 of eukaryotes) from the Proteome Analysis database at the European Bioinformatics Institute (EBI). Solving the structures of proteins from the 5000 largest Pfam families would allow accurate fold assignment for approximately 68% of all prokaryotic proteins (covering 59% of residues) and 61% of eukaryotic proteins (40% of residues). More fine-grained coverage that would allow accurate modeling of these proteins would require an order of magnitude more targets. The Pfam5000 strategy may be modified in several ways, for example, to focus on larger families, bacterial sequences, or eukaryotic sequences; as long as secondary consideration is given to large families within Pfam, coverage results vary only slightly. In contrast, focusing structural genomics on a single tractable genome would have only a limited impact in structural knowledge of other proteomes: A significant fraction (about 30-40% of the proteins and 40-60% of the residues) of each proteome is classified in small families, which may have little overlap with other species of interest. Random selection of targets from one or more genomes is similar to the Pfam5000 strategy in that proteins from larger families are more likely to be chosen, but substantial effort would be spent on small families.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Berkeley Structural Genomics Center, Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, USA
| | | |
Collapse
|
10
|
Estrada E. A protein folding degree measure and its dependence on crystal packing, protein size, secondary structure, and domain structural class. ACTA ACUST UNITED AC 2005; 44:1238-50. [PMID: 15272831 DOI: 10.1021/ci034278x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Comparing two or more protein structures with respect to their degree of folding is common practice in structural biology despite the fact that there is no scale for a folding degree. Here we introduce a formal definition of a folding degree, capable of quantitative characterization. This enables ordering among protein chains based on their degree of folding. The folding degree of a data set of 152 representative nonhomologous proteins is then studied. We demonstrate that the variation in the folding degree seen for this data set is not due to crystallization artifacts or experimental conditions, such as resolution, refinement protocol, pH, or temperature. A good linear relationship is observed between the folding degree and the percentages of secondary structures in the protein. The folding degree is able to account for the small changes produced in the structure due to crystal packing and temperature. Automating the classification of proteins into their respective structural domain classes, namely mainly-alpha, mainly-beta, and alpha-beta, is also possible.
Collapse
Affiliation(s)
- Ernesto Estrada
- Molecular Informatics, X-ray Unit, RIAIDT, Edificio CACTUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain.
| |
Collapse
|
11
|
Szyperski T, Mills JL, Perl D, Balbach J. Combined NMR-observation of cold denaturation in supercooled water and heat denaturation enables accurate measurement of deltaC(p) of protein unfolding. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2005; 35:363-6. [PMID: 16240113 DOI: 10.1007/s00249-005-0028-4] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2005] [Revised: 09/23/2005] [Accepted: 09/28/2005] [Indexed: 10/25/2022]
Abstract
Cold and heat denaturation of the double mutant Arg 3-->Glu/Leu 66-->Glu of cold shock protein Csp of Bacillus caldolyticus was monitored using 1D (1)H NMR spectroscopy in the temperature range from -12 degrees C in supercooled water up to +70 degrees C. The fraction of unfolded protein, f (u), was determined as a function of the temperature. The data characterizing the unfolding transitions could be consistently interpreted in the framework of two-state models: cold and heat denaturation temperatures were determined to be -11 degrees C and 39 degrees C, respectively. A joint fit to both cold and heat transition data enabled the accurate spectroscopic determination of the heat capacity difference between native and denatured state, DeltaC(p) of unfolding. The approach described in this letter, or a variant thereof, is generally applicable and promises to be of value for routine studies of protein folding.
Collapse
Affiliation(s)
- Thomas Szyperski
- Department of Chemistry, The State University of New York, Buffalo, NY 14260, USA.
| | | | | | | |
Collapse
|
12
|
Czajlik A, Perczel A. Peptide models XXXII. Computed chemical shift analysis of penetratin fragments. ACTA ACUST UNITED AC 2004. [DOI: 10.1016/j.theochem.2003.12.036] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
13
|
Abstract
The rapid developments in biotechnology create a great demand for fluid handling systems on the nano- and picoliter scale. The characterization of minute quantities of DNA or protein samples requires highly integrated, automated, and miniaturized "total analysis systems" (mu-TAS). The small scales necessitate new concepts for devices both from a technological and from a fundamental physical point of view. Here, we describe recent trends in both areas. New technologies include soft lithography, chemical, and topographical structuring of surfaces in order to define pathways for liquids, as well as electro-wetting for manipulation purposes. Fundamentally, the interplay between geometric confinement and the size of biological macromolecules gives rise to complex dynamic behavior. The combination of both fluorescence imaging and scattering techniques allows for detailed insight into the dynamics of individual molecules and into their self-assembly into supramolecular aggregates.
Collapse
Affiliation(s)
- Thomas Pfohl
- Universität Ulm, Albert-Einstein-Allee 11, 89069 Ulm, Germany
| | | | | | | |
Collapse
|
14
|
Abstract
Better mechanistic understanding of disease through mapping of the human and mouse genomes enables rethinking of human infirmity. In the case of cancer, for example, we may begin to associate disease states with their underlying genetic defects rather than with the organ system involved. That will enable more selective, nontoxic therapies in patients who are genetically predisposed to respond to them. Because one of the major goals of molecular imaging research is to interrogate gene expression noninvasively, it can impact greatly on that process. Most of molecular imaging research is undertaken in small animals, which provide a conduit between in vitro studies and human clinical imaging. We are fortunate to be able to manipulate small animals genetically, and to have increasingly better models of human disease. The ability to study those animals noninvasively and quantitatively with new, high-resolution imaging devices provides the most relevant milieu in which to find and examine new therapies.
Collapse
Affiliation(s)
- Martin G Pomper
- Johns Hopkins University School of Medicine, Department of Radiology, Baltimore, Maryland 21287-2182, USA.
| |
Collapse
|
15
|
Bader GD, Heilbut A, Andrews B, Tyers M, Hughes T, Boone C. Functional genomics and proteomics: charting a multidimensional map of the yeast cell. Trends Cell Biol 2003; 13:344-56. [PMID: 12837605 DOI: 10.1016/s0962-8924(03)00127-2] [Citation(s) in RCA: 79] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The challenge of large-scale functional genomics projects is to build a comprehensive map of the cell including genome sequence and gene expression data, information on protein localization, structure, function and expression, post-translational modifications, molecular and genetic interactions and phenotypic descriptions. Some of this broad set of functional genomics data has been already assembled for the budding yeast. Even though molecular cartography of the yeast cell is still far from comprehensive, functional genomics has begun to forge connections between disparate cellular events and to foster numerous hypotheses. Here we review several different genomics and proteomics technologies and describe bioinformatics methods for exploring these data to make new discoveries.
Collapse
Affiliation(s)
- Gary D Bader
- Computational Biology Center, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, Box 460, 10021, New York, NY, USA
| | | | | | | | | | | |
Collapse
|
16
|
Abstract
Genome sequencing projects have provided a wealth of data, most notably the primary sequences of all the proteins that a given organism can produce. The understanding of this information at the functional level is still in the beginning stages. Three-dimensional structural information is necessary to unravel at the atomic level the mechanisms by which a protein carries out its function, and such information can often be very useful to predict at least gross functional features, even in the absence of biochemical data. An exhaustive structural characterization of the proteins encoded in the genomes is thus highly desirable. To enhance the functional insights provided by genome-scale structural determination, we have prioritized our research to target specific processes of the cell, i.e., those responsible for controlling metal homeostasis. In this Account, we present the results obtained by the Magnetic Resonance Center of the University of Florence on proteins involved in the homeostasis of copper. The general research strategy is presented, followed by a discussion focused on different key experimental aspects. An overview of the initial results and of their relevance to the understanding of molecular function and cellular processes is also given.
Collapse
Affiliation(s)
- Lucia Banci
- Magnetic Resonance Center, University of Florence, Via Luigi Sacconi 6, 50019, Sesto Fiorentino, Italy.
| | | |
Collapse
|
17
|
Abstract
The rapid growth of bio-sequence information has resulted in an increasing demand for reliable methods that group proteins. A few databases with curated alignments of protein families have demonstrated that expert-driven repositories can keep up with the data deluge in the genome era. These original resources implicitly identify domain-like modules in proteins. An increasing number of automatic methods have sprouted over the past few years that cluster the protein universe. Many of these implicitly dissect proteins into structural domain-like fragments. In a very coarse-grained evaluation, some of the automatic methods appear to be on par with expert-driven approaches. However, neither automatic nor manual methods are currently entirely up to the challenges of tasks such as target selection in structural genomics. Thus, we urgently need refined and sustained automatic clustering tools.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC and North East Structural Genomics Consortium, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
| | | |
Collapse
|
18
|
Lan N, Montelione GT, Gerstein M. Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level. Curr Opin Chem Biol 2003; 7:44-54. [PMID: 12547426 DOI: 10.1016/s1367-5931(02)00020-0] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A principal aim of post-genomic biology is elucidating the structures, functions and biochemical properties of all gene products in a genome. However, to adequately comprehend such a large amount of information we need new descriptions of proteins that scale to the genomic level. In short, we need a unified ontology for proteomics. Much progress has been made towards this end, including a variety of approaches to systematic structural and functional classification and initial work towards developing standardized, unified descriptions for protein properties. In relation to function, there is a particularly great diversity of approaches, involving placing a protein in structured hierarchies or more-generalized networks and a recent approach based on circumscribing a protein's function through systematic enumeration of molecular interactions.
Collapse
Affiliation(s)
- Ning Lan
- Department of Molecular Biophysics, New Haven, CT 06520, USA.
| | | | | |
Collapse
|
19
|
Hung LH, Samudrala R. Accurate and automated classification of protein secondary structure with PsiCSI. Protein Sci 2003; 12:288-95. [PMID: 12538892 PMCID: PMC2312422 DOI: 10.1110/ps.0222303] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
PsiCSI is a highly accurate and automated method of assigning secondary structure from NMR data, which is a useful intermediate step in the determination of tertiary structures. The method combines information from chemical shifts and protein sequence using three layers of neural networks. Training and testing was performed on a suite of 92 proteins (9437 residues) with known secondary and tertiary structure. Using a stringent cross-validation procedure in which the target and homologous proteins were removed from the databases used for training the neural networks, an average 89% Q3 accuracy (per residue) was observed. This is an increase of 6.2% and 5.5% (representing 36% and 33% fewer errors) over methods that use chemical shifts (CSI) or sequence information (Psipred) alone. In addition, PsiCSI improves upon the translation of chemical shift information to secondary structure (Q3 = 87.4%) and is able to use sequence information as an effective substitute for sparse NMR data (Q3 = 86.9% without (13)C shifts and Q3 = 86.8% with only H(alpha) shifts available). Finally, errors made by PsiCSI almost exclusively involve the interchange of helix or strand with coil and not helix with strand (<2.5 occurrences per 10000 residues). The automation, increased accuracy, absence of gross errors, and robustness with regards to sparse data make PsiCSI ideal for high-throughput applications, and should improve the effectiveness of hybrid NMR/de novo structure determination methods. A Web server is available for users to submit data and have the assignment returned.
Collapse
Affiliation(s)
- Ling-Hong Hung
- Computational Genomics, Department of Microbiology, University of Washington, Seattle 98109, USA
| | | |
Collapse
|
20
|
Abstract
PEP is a database of Predictions for Entire Proteomes. The database contains summaries of analyses of protein sequences from a range of organisms representing all three major kingdoms of life: eukaryotes, prokaryotes and archaea. All proteins publicly available for organisms were aligned against SWISS-PROT, TrEMBL and PDB. Additionally, the following annotations are provided: secondary structure, transmembrane helices, coiled coils, regions of low complexity, signal peptides, PROSITE motifs, nuclear localization signals and classes of cellular function. Proteins that contain long regions without regular secondary structure are also identified. We have produced a related database of structural domain-like fragments derived from PEP and clusters based on homology between all fragments. The PEP database, fragments and clusters are distributed freely as a set of flat files and have been integrated into SRS. The PEP group of databases can be accessed from: http://cubic.bioc.columbia.edu/pep.
Collapse
Affiliation(s)
- Phil Carter
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | | | |
Collapse
|
21
|
Betz SF, Baxter SM, Fetrow JS. Function first: a powerful approach to post-genomic drug discovery. Drug Discov Today 2002; 7:865-71. [PMID: 12546953 DOI: 10.1016/s1359-6446(02)02398-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In the post-genomic era, pharmaceutical researchers must evaluate vast numbers of protein sequences and formulate novel, intelligent strategies for identifying valid targets and discovering leads against them. The identification of small molecules that selectively target proteins or protein families will be aided by knowing the function and/or the structure of the target(s). By identifying protein function first, efficiencies are gained that allow subsequent focus of resources on particular protein families of interest. This article reviews current proteomic-scale approaches to identifying function as a way of accelerating lead discovery.
Collapse
Affiliation(s)
- Stephen F Betz
- GeneFormatics, 5830 Oberlin Drive, Suite 200, San Diego, CA 92121, USA
| | | | | |
Collapse
|
22
|
Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y, Cheung KH, Miller P, Gerstein M, Roeder GS, Snyder M. Subcellular localization of the yeast proteome. Genes Dev 2002; 16:707-19. [PMID: 11914276 PMCID: PMC155358 DOI: 10.1101/gad.970902] [Citation(s) in RCA: 558] [Impact Index Per Article: 25.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Protein localization data are a valuable information resource helpful in elucidating eukaryotic protein function. Here, we report the first proteome-scale analysis of protein localization within any eukaryote. Using directed topoisomerase I-mediated cloning strategies and genome-wide transposon mutagenesis, we have epitope-tagged 60% of the Saccharomyces cerevisiae proteome. By high-throughput immunolocalization of tagged gene products, we have determined the subcellular localization of 2744 yeast proteins. Extrapolating these data through a computational algorithm employing Bayesian formalism, we define the yeast localizome (the subcellular distribution of all 6100 yeast proteins). We estimate the yeast proteome to encompass approximately 5100 soluble proteins and >1000 transmembrane proteins. Our results indicate that 47% of yeast proteins are cytoplasmic, 13% mitochondrial, 13% exocytic (including proteins of the endoplasmic reticulum and secretory vesicles), and 27% nuclear/nucleolar. A subset of nuclear proteins was further analyzed by immunolocalization using surface-spread preparations of meiotic chromosomes. Of these proteins, 38% were found associated with chromosomal DNA. As determined from phenotypic analyses of nuclear proteins, 34% are essential for spore viability--a percentage nearly twice as great as that observed for the proteome as a whole. In total, this study presents experimentally derived localization data for 955 proteins of previously unknown function: nearly half of all functionally uncharacterized proteins in yeast. To facilitate access to these data, we provide a searchable database featuring 2900 fluorescent micrographs at http://ygac.med.yale.edu.
Collapse
Affiliation(s)
- Anuj Kumar
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|