1
|
Abstract
MOTIVATION Analysis of large biological data sets using a variety of parallel processor computer architectures is a common task in bioinformatics. The efficiency of the analysis can be significantly improved by properly handling redundancy present in these data combined with taking advantage of the unique features of these compute architectures. RESULTS We describe a generalized approach to this analysis, but present specific results using the program CEPAR, an efficient implementation of the Combinatorial Extension algorithm in a massively parallel (PAR) mode for finding pairwise protein structure similarities and aligning protein structures from the Protein Data Bank. CEPAR design and implementation are described and results provided for the efficiency of the algorithm when run on a large number of processors. AVAILABILITY Source code is available by contacting one of the authors.
Collapse
Affiliation(s)
- D Pekurovsky
- San Diego Supercomputer Center, University of California San Diego, La Jolla 92093, USA
| | | | | |
Collapse
|
2
|
Bourne PE, Allerston CKJ, Krebs W, Li W, Shindyalov IN, Godzik A, Friedberg I, Liu T, Wild D, Hwang S, Ghahramani Z, Chen L, Westbrook J. The status of structural genomics defined through the analysis of current targets and structures. Pac Symp Biocomput 2004:375-86. [PMID: 14992518 DOI: 10.1142/9789812704856_0036] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Structural genomics--large-scale macromolecular 3-dimenional structure determination--is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://targetdb.pdb.org) reports this progress through the status of each protein sequence (target) under consideration by the major structural genomics centers worldwide. Hence, TargetDB provides a unique opportunity to analyze the potential impact that this major initiative provides to scientists interested in the sequence-structure-function-disease paradigm. Here we report such an analysis with a focus on: (i) temporal characteristics--how is the project doing and what can we expect in the future? (ii) target characteristics--what are the predicted functions of the proteins targeted by structural genomics and how biased is the target set when compared to the PDB and to predictions across complete genomes? (iii) structures solved--what are the characteristics of structures solved thus far and what do they contribute? The analysis required a more extensive database of structure predictions using different methods integrated with data from other sources. This database, associated tools and related data sources are available from http://spam.sdsc.edu.
Collapse
Affiliation(s)
- P E Bourne
- The San Diego Supercomputer Center, The University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Guda C, Scheeff ED, Bourne PE, Shindyalov IN. A new algorithm for the alignment of multiple protein structures using Monte Carlo optimization. Pac Symp Biocomput 2001:275-86. [PMID: 11262947 DOI: 10.1142/9789814447362_0028] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We have developed a new algorithm for the alignment of multiple protein structures based on a Monte Carlo optimization technique. The algorithm uses pair-wise structural alignments as a starting point. Four different types of moves were designed to generate random changes in the alignment. A distance-based score is calculated for each trial move and moves are accepted or rejected based on the improvement in the alignment score until the alignment is converged. Initial tests on 66 protein structural families show promising results, the score increases by 69% on average. The increase in score is accompanied by an increase (12%) in the number of residue positions incorporated into the alignment. Two specific families, protein kinases and aspartic proteinases were tested and compared against curated alignments from HOMSTRAD and manual alignments. This algorithm has improved the overall number of aligned residues while preserving key catalytic residues. Further refinement of the method and its application to generate multiple alignments for all protein families in the PDB, is currently in progress.
Collapse
Affiliation(s)
- C Guda
- San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA
| | | | | | | |
Collapse
|
4
|
Abstract
An all-against-all protein structure comparison using the Combinatorial Extension (CE) algorithm applied to a representative set of PDB structures revealed a gallery of common substructures in proteins (http://cl.sdsc.edu/ce.html). These substructures represent commonly identified folds, domains, or components thereof. Most of the subsequences forming these similar substructures have no significant sequence similarity. We present a method to identify conserved amino acid positions and residue-dependent property clusters within these subsequences starting with structure alignments. Each of the subsequences is aligned to its homologues in SWALL, a nonredundant protein sequence database. The most similar sequences are purged into a common frequency matrix, and weighted homologues of each one of the subsequences are used in scoring for conserved key amino acid positions (CKAAPs). We have set the top 20% of the high-scoring positions in each substructure to be CKAAPs. It is hypothesized that CKAAPs may be responsible for the common folding patterns in either a local or global view of the protein-folding pathway. Where a significant number of structures exist, CKAAPs have also been identified in structure alignments of complete polypeptide chains from the same protein family or superfamily. Evidence to support the presence of CKAAPs comes from other computational approaches and experimental studies of mutation and protein-folding experiments, notably the Paracelsus challenge. Finally, the structural environment of CKAAPs versus non-CKAAPs is examined for solvent accessibility, hydrogen bonding, and secondary structure. The identification of CKAAPs has important implications for protein engineering, fold recognition, modeling, and structure prediction studies and is dependent on the availability of structures and an accurate structure alignment methodology. Proteins 2001;42:148-163.
Collapse
Affiliation(s)
- B V Reddy
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, California 92093-0505, USA
| | | | | | | |
Collapse
|
5
|
Shindyalov IN, Bourne PE. A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucleic Acids Res 2001; 29:228-9. [PMID: 11125099 PMCID: PMC29823 DOI: 10.1093/nar/29.1.228] [Citation(s) in RCA: 82] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The database reported here is derived using the Combinatorial Extension (CE) algorithm which compares pairs of protein polypeptide chains and provides a list of structurally similar proteins along with their structure alignments. Using CE, structure-structure alignments can provide insights into biological function. When a protein of known function is shown to be structurally similar to a protein of unknown function, a relationship might be inferred; a relationship not necessarily detectable from sequence comparison alone. Establishing structure-structure relationships in this way is of great importance as we enter an era of structural genomics where there is a likelihood of an increasing number of structures with unknown functions being determined. Thus the CE database is an example of a useful tool in the annotation of protein structures of unknown function. Comparisons can be performed on the complete PDB or on a structurally representative subset of proteins. The source protein(s) can be from the PDB (updated monthly) or uploaded by the user. CE provides sequence alignments resulting from structural alignments and Cartesian coordinates for the aligned structures, which may be analyzed using the supplied Compare3D Java applet, or downloaded for further local analysis. Searches can be run from the CE web site, http://cl.sdsc.edu/ce.html, or the database and software downloaded from the site for local use.
Collapse
Affiliation(s)
- I N Shindyalov
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | | |
Collapse
|
6
|
Abstract
The Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. The derivation and significance of CKAAPs starting from pairwise structure alignments is described fully in Reddy et al. [Reddy,B.V.B., Li,W.W., Shindyalov,I.N. and Bourne,P.E. (2000) PROTEINS:, in press]. The CKAAPs identified from this theoretical analysis are provided to experimentalists and theoreticians for potential use in protein engineering and modeling. It has been suggested that CKAAPs may be crucial features for protein folding, structural stability and function. Over 170 substructures, as defined by the Combinatorial Extension (CE) database, which are found in approximately 3000 representative polypeptide chains have been analyzed and are available in the CKAAPs DB. CKAAPs DB also provides CKAAPs of the representative set of proteins derived from the CE and FSSP databases. Thus the database contains over 5000 representative poly-peptide chains, covering all known structures in the PDB. A web interface to a relational database permits fast retrieval of structure-sequence alignments, CKAAPs and associated statistics. Users may query by PDB ID, protein name, function and Enzyme Classification number. Users may also submit protein alignments of their own to obtain CKAAPs. An interface to display CKAAPs on each structure from a web browser is also being implemented. CKAAPs DB is maintained by the San Diego Supercomputer Center and accessible at the URL http://ckaaps.sdsc.edu.
Collapse
Affiliation(s)
- W W Li
- San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA
| | | | | | | |
Collapse
|
7
|
Shindyalov IN, Bourne PE. An alternative view of protein fold space. Proteins 2000; 38:247-60. [PMID: 10713986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
Abstract
Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.
Collapse
|
8
|
Abstract
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Collapse
Affiliation(s)
- H M Berman
- Research Collaboratory for Structural Bioinformatics (RCSB), Rutgers University, Piscataway, NJ 08854-8087, USA.
| | | | | | | | | | | | | | | |
Collapse
|
9
|
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28:235-42. [PMID: 10592235 PMCID: PMC102472 DOI: 10.1093/nar/28.1.235] [Citation(s) in RCA: 25692] [Impact Index Per Article: 1070.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/1999] [Revised: 10/17/1999] [Accepted: 10/17/1999] [Indexed: 11/14/2022] Open
Abstract
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Collapse
Affiliation(s)
- H M Berman
- Research Collaboratory for Structural Bioinformatics (RCSB), Rutgers University, Piscataway, NJ 08854-8087, USA.
| | | | | | | | | | | | | | | |
Collapse
|
10
|
Tsigelny I, Shindyalov IN, Bourne PE, Südhof TC, Taylor P. Common EF-hand motifs in cholinesterases and neuroligins suggest a role for Ca2+ binding in cell surface associations. Protein Sci 2000; 9:180-5. [PMID: 10739260 PMCID: PMC2144444 DOI: 10.1110/ps.9.1.180] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Comparisons of protein sequence via cyclic training of Hidden Markov Models (HMMs) in conjunction with alignments of three-dimensional structure, using the Combinatorial Extension (CE) algorithm, reveal two putative EF-hand metal binding domains in acetylcholinesterase. Based on sequence similarity, putative EF-hands are also predicted for the neuroligin family of cell surface proteins. These predictions are supported by experimental evidence. In the acetylcholinesterase crystal structure from Torpedo californica, the first putative EF-hand region binds the Zn2+ found in the heavy metal replacement structure. Further, the interaction of neuroligin 1 with its cognate receptor neurexin depends on Ca2+. Thus, members of the alpha,beta hydrolase fold family of proteins contain potential Ca2+ binding sites, which in some family members may be critical for heterologous cell associations.
Collapse
Affiliation(s)
- I Tsigelny
- Department of Pharmacology, University of California, San Diego, La Jolla 92093-0654, USA.
| | | | | | | | | |
Collapse
|
11
|
Weissig H, Shindyalov IN, Bourne PE. Macromolecular structure databases: past progress and future challenges. Acta Crystallogr D Biol Crystallogr 1998; 54:1085-94. [PMID: 10089484 DOI: 10.1107/s0907444998009846] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Databases containing macromolecular structure data provide a crystallographer with important tools for use in solving, refining and understanding the functional significance of their protein structures. Given this importance, this paper briefly summarizes past progress by outlining the features of the significant number of relevant databases developed to date. One recent database, PDB+, containing all current and obsolete structures deposited with the Protein Data Bank (PDB) is discussed in more detail. PDB+ has been used to analyze the self-consistency of the current (1 January 1998) corpus of over 7000 structures. A summary of those findings is presented (a full discussion will appear elsewhere) in the form of global and temporal trends within the data. These trends indicate that challenges exist if crystallographers are to provide the community with complete and consistent structural results in the future. It is argued that better information management practices are required to meet these challenges.
Collapse
Affiliation(s)
- H Weissig
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | | | | |
Collapse
|
12
|
Abstract
A new algorithm is reported which builds an alignment between two protein structures. The algorithm involves a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs) rather than the more conventional techniques using dynamic programming and Monte Carlo optimization. AFPs, as the name suggests, are pairs of fragments, one from each protein, which confer structure similarity. AFPs are based on local geometry, rather than global features such as orientation of secondary structures and overall topology. Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded thereby leading to a single optimal alignment. The algorithm is fast and accurate in finding an optimal structure alignment and hence suitable for database scanning and detailed analysis of large protein families. The method has been tested and compared with results from Dali and VAST using a representative sample of similar structures. Several new structural similarities not detected by these other methods are reported. Specific one-on-one alignments and searches against all structures as found in the Protein Data Bank (PDB) can be performed via the Web at http://cl.sdsc.edu/ce.html.
Collapse
|
13
|
Affiliation(s)
- C M Smith
- San Diego Supercomputer Center (SDSC), CA 92186, USA
| | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
MOTIVATION To provide data management tools to maintain and query efficiently experimental and derived protein data with the goal of providing new insights into structure-function relationships. The tools should be portable, extensible, and accessible locally, or via the World Wide Web, providing data that would not otherwise be available. RESULTS The initial phase of the work, the data representation and query of all available macromolecular structure data, including real-time access to complex property patterns based on the amino acid sequence, is reported. protein structure data taken from the Protein Data Bank (PDB) are decomposed into native and derived elementary properties, and represented as compact indexed objects minimizing storage requirements and query time for select types of query. In addition, collections of indices representing a particular property are maintained and can be queried for specific property patterns found across the whole database. The approach is proving applicable to a wide variety of data available on specific protein families.
Collapse
Affiliation(s)
- I N Shindyalov
- San Diego Supercomputer Center, CA 92186-9784, USA. shindyal,
| | | |
Collapse
|
15
|
Bourne PE, Shindyalov IN. A local macromolecular structure database for crystallographic laboratories. Acta Crystallogr A 1996. [DOI: 10.1107/s0108767396095967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
16
|
|
17
|
Kolchanov NA, Vishnevsky OV, Babenko VN, Kel AE, Shindyalov IN. Identification of cDNA sequences by specific oligonucleotide sets. Computer tool and application. Proc Int Conf Intell Syst Mol Biol 1995; 3:206-214. [PMID: 7584438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
A computer tool has been developed for revealing sets of oligonucleotides invariant for isofunctional families of DNA (RNA) and for using these in functional identification of nucleotide sequences. The tool allows one to: build up vocabularies of invariant oligonucleotides for the families of isofunctional nucleotide sequences; assess significance of the vocabularies; identify nucleotide sequences with the vocabularies of invariant oligonucleotides; determine the most effective identification parameters to minimize first and second type errors; assess the efficiency of identification of individual isofunctional families with the oligonucleotide vocabularies; determine the evolutionary characteristics of the families of isofunctional sequences on which vocabulary volume depends. Based on the system mentioned, we have analyzed a total of 322 protein-encoding gene families and have built up sets of invariant oligonucleotides, or again, oligonucleotide vocabularies that are characteristic of gene families and subfamilies. Identification of nucleotide sequences belonging to these families with the sets of invariant oligonucleotides revealed has been shown. Under the most effective identification parameters, the first type error (false negative) on control (independent) data was 10-15%, the second type error (false positive) was just 1-2 redundant sequences per sequence being examined. As has been shown, the volume of a vocabulary of invariant oligonucleotides depends on the percentage of variable positions in the multiple alignment within a family.
Collapse
Affiliation(s)
- N A Kolchanov
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
| | | | | | | | | |
Collapse
|
18
|
Abstract
PDBlib is an extensible object-oriented class library written in C++ for representing the three-dimensional structure of biological macromolecules. The software design strategy, features of many of the 129 classes currently distributed with the library, and two sample applications which use the library are described. Version 1.0 of the library represents the structural features of proteins, DNA, RNA and complexes thereof, at a level of detail on a par with that which can be parsed from a Protein Data Bank (PDB) entry. However, the memory-resident representation of the macromolecule is independent of the PDB entry and can be obtained from other sources, e.g. relational and object-oriented databases. PDBlib classes are organized into four categories: (i) classes that model the macromolecule; (ii) classes that enhance the extensibility of the library; (iii) classes that provide navigation facilities of the object-oriented macromolecular structure representation; and (iv) a class that loads a PDB file into the memory-resident object-oriented representation. A number of general-purpose procedures that return features of this representation and that are relevant to all biological disciplines are included in (i). The library has been used to develop PDBtool, a prototype structure verification tool, and PDBview, a structure rendering tool that requires no specialized graphics hardware and software. Current work centers on making the macromolecular structures represented by PDBlib persistent using a commercial object-oriented database and providing an additional class library, MMQLlib, to query those structures.
Collapse
Affiliation(s)
- W Chang
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University, New York, NY 10032, USA
| | | | | | | |
Collapse
|
19
|
Abstract
A combinatorial sequence space (CSS) model was introduced to represent sequences as a set of overlapping k-tuples of some fixed length which correspond to points in the CSS. The aim was to analyze clusterization of protein sequences in the CSS and to test various hypotheses about the possible evolutionary basis of this clusterization. The authors developed an easy-to-use technique which can reveal and analyze such a clusterization in a multidimensional CSS. Application of the technique led to an unexpectedly high clusterization of points in the CSS corresponding to k-tuples from known proteins. The clusterization could not be inferred from nonuniform amino acid frequencies or be explained by the influence of homologous data. None of the tested possible evolutionary and structural factors could explain the clusterization observed either. It looked as if certain protein sequence variations occurred and were fixed in the early course of evolution. Subsequent evolution (predominantly neutral) allowed only a limited number of changes and permitted new variants which led to preservation of certain k-tuples during the course of evolution. This was consistent with the theory of exon shuffling and protein block structure evolution. Possible applications of sequence space features found were also discussed.
Collapse
Affiliation(s)
- V B Strelets
- Supercomputer Computations Research Institute, Florida State University, Tallahassee 32306-4052
| | | | | |
Collapse
|
20
|
Abstract
Macromolecular query language (MMQL) is an extensible interpretive language in which to pose questions concerning the experimental or derived features of the 3-D structure of biological macromolecules. MMQL portends to be intuitive with a simple syntax, so that from a user's perspective complex queries are easily written. A number of basic queries and a more complex query--determination of structures containing a five-strand Greek key motif--are presented to illustrate the strengths and weaknesses of the language. The predominant features of MMQL are a filter and pattern grammar which are combined to express a wide range of interesting biological queries. Filters permit the selection of object attributes, for example, compound name and resolution, whereas the patterns currently implemented query primary sequence, close contacts, hydrogen bonding, secondary structure, conformation and amino acid properties (volume, polarity, isoelectric point, hydrophobicity and different forms of exposure). MMQL queries are processed by MMQLlib; a C++ class library, to which new query methods and pattern types are easily added. The prototype implementation described uses PDBlib, another C(++)-based class library from representing the features of biological macromolecules at the level of detail parsable from a PDB file. Since PDBlib can represent data stored in relational and object-oriented databases, as well as PDB files, once these data are loaded they too can be queried by MMQL. Performance metrics are given for queries of PDB files for which all derived data are calculated at run time and compared to a preliminary version of OOPDB, a prototype object-oriented database with a schema based on a persistent version of PDBlib which offers more efficient data access and the potential to maintain derived information. MMQLlib, PDBlib and associated software are available via anonymous ftp from cuhhca.hhmi.columbia.edu.
Collapse
Affiliation(s)
- I N Shindyalov
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032
| | | | | | | |
Collapse
|
21
|
Shindyalov IN, Kolchanov NA, Sander C. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 1994; 7:349-58. [PMID: 8177884 DOI: 10.1093/protein/7.3.349] [Citation(s) in RCA: 184] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
A method has been developed to detect pairs of positions with correlated mutations in protein multiple sequence alignments. The method is based on reconstruction of the phylogenetic tree for a set of sequences and statistical analysis of the distribution of mutations in the branches of the tree. The database of homology-derived protein structures (HSSP) is used as the source of multiple sequence alignments for proteins of known three-dimensional structure. We analyse pairs of positions with correlated mutations in 67 protein families and show quantitatively that the presence of such positions is a typical feature of protein families. A significant but weak tendency is observed for correlated residue pairs to be close in the three-dimensional structure. With further improvements, methods of this type may be useful for the prediction of residue--residue contacts and subsequent prediction of protein structure using distance geometry algorithms. In conclusion, we suggest a new experimental approach to protein structure determination in which selection of functional mutants after random mutagenesis and analysis of correlated mutations provide sufficient proximity constraints for calculation of the protein fold.
Collapse
Affiliation(s)
- I N Shindyalov
- Institute of Cytology and Genetics, Russian Academy of Sciences, Siberian Department, Novosibirsk
| | | | | |
Collapse
|
22
|
Streletc VB, Shindyalov IN, Kolchanov NA, Milanesi L. Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices. Comput Appl Biosci 1992; 8:529-34. [PMID: 1468007 DOI: 10.1093/bioinformatics/8.6.529] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
We present a new pairwise alignment algorithm that uses iterative statistical analysis of homologous subsequences. Apart from the classical conversion of the DOT-matrix characteristic of the Needleman-Wunsch algorithm (NW), we used only those matrix elements that corresponded to the most non-random subsequence homologies. The most reliable elements of the DOT-matrix are written to the compact competition matrices. The algorithm then searches for alignment on the base of only these matrix elements. Our algorithm has low storage and memory requirements, but provides a reliable alignment for the sequences of weak homology (or, at least for the homology regions). In such cases classical NW algorithms often produce unreliable results on the level of statistical noise due to accumulation of random matchings throughout the aligned sequences.
Collapse
Affiliation(s)
- V B Streletc
- Institute of Cytology and Genetics, Siberian Department of Russian Academy of Sciences, Novosibirsk
| | | | | | | |
Collapse
|
23
|
Kolchanov NA, Shindyalov IN. Single amino acid substitutions producing instability of globular proteins. Calculation of their frequencies in the entire mutational spectra of the alpha- and beta-subunits of human hemoglobin. J Mol Evol 1988; 27:154-62. [PMID: 3137354 DOI: 10.1007/bf02138376] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
The frequencies of substitutions resulting in protein instability were calculated by a method estimating changes in stability produced by amino acid substitutions. The method takes into account the accessibility of an amino acid position to a solvent and changes in the specificity of amino acid interactions. When tested on human mutant hemoglobins, the method yielded predictions with a preciseness of 80%. The consideration of the evolutionary homologous proteins in the analysis allowed us to estimate the evolutionary constraints imposed on stability of their spatial structure. With these limitations, approximately 50% of amino acid substitutions in the entire mutational spectra of the alpha- and beta-subunits of human hemoglobin were found to damage the spatial structure of the globular proteins.
Collapse
Affiliation(s)
- N A Kolchanov
- Laboratory of Populational Genetics, USSR Academy of Sciences, Novosibirsk
| | | |
Collapse
|
24
|
Shindyalov IN, Kolchanov NA. Analysis of the factors and implications of an empirical method for estimating the stability of mutant human haemoglobins. J Theor Biol 1985; 117:19-46. [PMID: 3935879 DOI: 10.1016/s0022-5193(85)80163-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
An empirical method for estimating the effects of single amino acid substitutions on structural stability of proteins with known spatial structure is developed. Twenty physical and chemical properties of amino acids and characteristics of protein tertiary structure were analysed to determine those most involved in producing instability. We employed data on 330 mutant variants of the alpha- and beta-subunits of human haemoglobin in choice of the parameters of the method developed which yielded a 81% of prediction accuracy of stability estimates for human mutant haemoglobins.
Collapse
|