1
|
Dehghani T, Naghibzadeh M, Sadri J. Enhancement of Protein β-Sheet Topology Prediction Using Maximum Weight Disjoint Path Cover. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1936-1947. [PMID: 29994539 DOI: 10.1109/tcbb.2018.2837753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Predicting β-sheet topology (β-topology) is one of the most critical intermediate steps towards protein structure and function prediction. The β-topology prediction problem is defined as the determination of the optimal arrangement of β-strand interactions within protein β-sheets. Significant efforts have been made to predict β-topologies. However, due to the inaccurate determination of interactions among β-strands and the huge topological space of proteins with a large number of β-strands, more efficient methods are required to improve both the accuracy and speed of β-topology prediction. In order to attain higher accuracy, the current paper introduces a bidirectional strand-strand interaction graph and considers all possible orientations (parallel and antiparallel) and orders of β-strand pairwise interactions. For the first time, the β-topology prediction is transformed into a maximum weight disjoint path cover solution by conserving all potential topologies. Moreover, to manage the computation time, a set of candidate β-sheets is generated and an optimization process is applied to select a subset of maximum score disjoint β-sheets as a predicted β-topology. The proposed method is comprehensively compared with state-of-the-art methods. The experimental results on the BetaSheet916 and BetaSheet1452 datasets reveal that the current study's approach enhances performance measurements as well as reduces the runtime.
Collapse
|
2
|
Herndon N, Caragea D. A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction. IEEE Trans Nanobioscience 2016; 15:75-83. [PMID: 26849871 DOI: 10.1109/tnb.2016.2522400] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
Collapse
|
3
|
Subramani A, Floudas CA. β-sheet topology prediction with high precision and recall for β and mixed α/β proteins. PLoS One 2012; 7:e32461. [PMID: 22427840 PMCID: PMC3302896 DOI: 10.1371/journal.pone.0032461] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2011] [Accepted: 01/26/2012] [Indexed: 11/19/2022] Open
Abstract
The prediction of the correct -sheet topology for pure and mixed proteins is a critical intermediate step toward the three dimensional protein structure prediction. The predicted beta sheet topology provides distance constraints between sequentially separated residues, which reduces the three dimensional search space for a protein structure prediction algorithm. Here, we present a novel mixed integer linear optimization based framework for the prediction of -sheet topology in and mixed proteins. The objective is to maximize the total strand-to-strand contact potential of the protein. A large number of physical constraints are applied to provide biologically meaningful topology results. The formulation permits the creation of a rank-ordered list of preferred -sheet arrangements. Finally, the generated topologies are re-ranked using a fully atomistic approach involving torsion angle dynamics and clustering. For a large, non-redundant data set of 2102 and mixed proteins with at least 3 strands taken from the PDB, the proposed approach provides the top 5 solutions with average precision and recall greater than 78%. Consistent results are obtained in the -sheet topology prediction for blind targets provided during the CASP8 and CASP9 experiments, as well as for actual and predicted secondary structures. The -sheet topology prediction algorithm, BeST, is available to the scientific community at http://selene.princeton.edu/BeST/.
Collapse
Affiliation(s)
| | - Christodoulos A. Floudas
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
4
|
Aydin Z, Altunbasak Y, Erdogan H. Bayesian models and algorithms for protein β-sheet prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:395-409. [PMID: 21233522 DOI: 10.1109/tcbb.2008.140] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Prediction of the 3D structure greatly benefits from the information related to secondary structure, solvent accessibility, and nonlocal contacts that stabilize a protein's structure. We address the problem of \beta-sheet prediction defined as the prediction of \beta--strand pairings, interaction types (parallel or antiparallel), and \beta-residue interactions (or contact maps). We introduce a Bayesian approach for proteins with six or less \beta-strands in which we model the conformational features in a probabilistic framework by combining the amino acid pairing potentials with a priori knowledge of \beta-strand organizations. To select the optimum \beta-sheet architecture, we significantly reduce the search space by heuristics that enforce the amino acid pairs with strong interaction potentials. In addition, we find the optimum pairwise alignment between \beta-strands using dynamic programming in which we allow any number of gaps in an alignment to model \beta-bulges more effectively. For proteins with more than six \beta-strands, we first compute \beta-strand pairings using the BetaPro method. Then, we compute gapped alignments of the paired \beta-strands and choose the interaction types and \beta--residue pairings with maximum alignment scores. We performed a 10-fold cross-validation experiment on the BetaSheet916 set and obtained significant improvements in the prediction accuracy.
Collapse
Affiliation(s)
- Zafer Aydin
- Department of Genome Sciences, University of Washington, Genome Sciences, Box 357456, 1705 NE Pacific St., Seattle, WA 98195-5065, USA.
| | | | | |
Collapse
|
5
|
Kumar A, Cowen L. Recognition of beta-structural motifs using hidden Markov models trained with simulated evolution. Bioinformatics 2010; 26:i287-93. [PMID: 20529918 PMCID: PMC2881384 DOI: 10.1093/bioinformatics/btq199] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related, has been profile hidden Markov models. However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in β-sheets. We thus explore methods for incorporating pairwise dependencies into these models. Results: We consider the remote homology detection problem for β-structural motifs. In particular, we ask if a statistical model trained on members of only one family in a SCOP β-structural superfamily, can recognize members of other families in that superfamily. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for β-structural motif recognition as compared to ordinary HMMs. Availability: All datasets and HMMs are available at: http://bcb.cs.tufts.edu/pairwise/ Contact:anoop.kumar@tufts.edu; lenore.cowen@tufts.edu
Collapse
Affiliation(s)
- Anoop Kumar
- Department of Computer Science, Tufts University, Medford, MA, USA.
| | | |
Collapse
|
6
|
Jeong J, Berman P, Przytycka TM. Improving strand pairing prediction through exploring folding cooperativity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:484-491. [PMID: 18989036 PMCID: PMC2597093 DOI: 10.1109/tcbb.2008.88] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The topology of beta-sheets is defined by the pattern of hydrogen-bonded strand pairing. Therefore, predicting hydrogen bonded strand partners is a fundamental step towards predicting beta-sheet topology. At the same time, finding the correct partners is very difficult due to long range interactions involved in strand pairing. Additionally, patterns of amino acids involved, in beta-sheet formations are very general and therefore difficult to use for computational recognition of specific contacts between strands. In this work, we report a new strand pairing algorithm. To address above mentioned difficulties, our algorithm attempts to mimic elements of the folding process. Namely, in addition to ensuring that the predicted hydrogen bonded strand pairs satisfy basic global consistency constraints, it takes into account hypothetical folding pathways. Consistently with this view, introducing hydrogen bonds between a pair of strands changes the probabilities of forming hydrogen bonds between other pairs of strand. We demonstrate that this approach provides an improvement over previously proposed algorithms. We also compare the performance of this method to that of a global optimization algorithm that poses the problem as integer linear programming optimization problem and solves it using ILOG CPLEX package.
Collapse
Affiliation(s)
- Jieun Jeong
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA
| | | | | |
Collapse
|
7
|
Abstract
The formation of beta-sheet domains in proteins involves five energetically important factors: the formation of networks of hydrogen bonds and hydrophobic faces, and the residue propensities, or preferences, to be found at the edges of the beta-sheet, to adopt the extended conformation, and to make contact with other residues. These relative energy contributions define a potential energy function. Here, we show how optimizing this potential energy function reveals the formation of hydrophobic faces as the utmost factor. The potential energy function was optimized to minimize the Z-scores of the native topologies among the exhaustive sets of over 400 different beta-sheets. These results corroborate with experimental data that showed the environment of a protein is an important modulator of beta-sheet folding. The contact propensities were found to be the least important, which could explain the poor predictive power of beta-strand alignment methods based on pair-wise contact matrices.
Collapse
Affiliation(s)
- Marc Parisien
- Department of Computer Science and Operations Research, Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Québec, Canada
| | | |
Collapse
|
8
|
González-Díaz H, Uriarte E. Biopolymer stochastic moments. I. Modeling human rhinovirus cellular recognition with protein surface electrostatic moments. Biopolymers 2006; 77:296-303. [PMID: 15648087 DOI: 10.1002/bip.20234] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Stochastic moments may be applied as molecular descriptors in quantitative structure-activity relationship (QSAR) studies for small molecules (H. González-Dìaz et al., Journal of Molecular Modeling, 2002, Vol. 8, pp. 237-245; 2003, Vol. 9, pp. 395-407). However, applications in the field of biopolymers are less known. Recently, the MARCH-INSIDE approach has been generalized to encode structural features of proteins and other biopolymers (H. González-Dáaz et al., Bioinformatics, 2003, Vol. 19, pp. 2079-2087; Bioorganic & Medicinal Chemistry Letters, 2004, Vol. 14, pp. 4691-4695; Polymers, 2004, Vol. 45, pp. 3845-3853; Bioorganic & Medicinal Chemistry, 2005, Vol. 13, pp. 323-331). The present article attempts to extend this research by introducing for the first time stochastic moments for a surface road map of viral proteins. These moments are afterward used to seek a model that predicts the cellular receptor for human rhinoviruses. The model correctly classified 100% of 10 viruses binding to low-density lipoprotein receptor (LDLR) and 88.9% of 9 viruses binding to the intracellular adhesion molecule (ICAM) receptors in training. The same results have been obtained in four cross-validation experiments using a resubstitution technique. The present model favorably compares, in terms of complexity, with other previously reported based on entropy considerations, and offers a quantitative basis for the visual rule previously reported by Vlasak et al.
Collapse
|
9
|
Cruz-Monteagudo M, González-Díaz H, Uriarte E. Simple Stochastic Fingerprints Towards Mathematical Modeling in Biology and Medicine 2. Unifying Markov Model for Drugs Side Effects. Bull Math Biol 2006; 68:1527-54. [PMID: 16847720 DOI: 10.1007/s11538-005-9013-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2005] [Accepted: 05/09/2005] [Indexed: 10/24/2022]
Abstract
Most of present mathematical models for biological activity consider just the molecular structure. In the present article we pretend extending the use of Markov chain models to define novel molecular descriptors, which consider in addition other parameters like target site or biological effect. Specifically, this mathematical model takes into consideration not only the molecular structure but the specific biological system the drug affects too. Herein, a general Markov model is developed that describes 19 different drugs side effects grouped in eight affected biological systems for 178 drugs, being 270 cases finally. The data was processed by linear discriminant analysis (LDA) classifying drugs according to their specific side effects, forward stepwise was fixed as strategy for variables selection. The average percentage of good classification and number of compounds used in the training/predicting sets were 100/95.8% for endocrine manifestations, (18 out of 18)/(13 out of 14); 90.5/92.3% for gastrointestinal manifestations, (38 out of 42)/(30 out of 32); 88.5/86.5% for systemic phenomena, (23 out of 26)/(17 out of 20); 81.8/77.3% for neurological manifestations, (27 out of 33)/(19 out of 25); 81.6/86.2% for dermal manifestations, (31 out of 38)/(25 out of 29); 78.4/85.1% for cardiovascular manifestation, (29 out of 37)/(24 out of 28); 77.1/75.7% for breathing manifestations, (27 out of 35)/(20 out of 26) and 75.6/75% for psychiatric manifestations, (31 out of 41)/(23 out of 31). Additionally a back-projection analysis (BPA) was carried out for two ulcerogenic drugs to prove in structural terms the physical interpretation of the models obtained. This article develops a mathematical model that encompasses a large number of drugs side effects grouped in specifics biological systems using stochastic absolute probabilities of interaction ((A)pi(k)(j)) by the first time.
Collapse
Affiliation(s)
- Maykel Cruz-Monteagudo
- Applied Chemistry Research Center and Chemical Bioactives Center, Central University of Las Villas, Santa Clara, 54830, Cuba
| | | | | |
Collapse
|
10
|
Sun XD, Huang RB. Prediction of protein structural classes using support vector machines. Amino Acids 2006; 30:469-75. [PMID: 16622605 DOI: 10.1007/s00726-005-0239-0] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2005] [Accepted: 07/12/2005] [Indexed: 11/24/2022]
Abstract
The support vector machine, a machine-learning method, is used to predict the four structural classes, i.e. mainly alpha, mainly beta, alpha-beta and fss, from the topology-level of CATH protein structure database. For the binary classification, any two structural classes which do not share any secondary structure such as alpha and beta elements could be classified with as high as 90% accuracy. The accuracy, however, will decrease to less than 70% if the structural classes to be classified contain structure elements in common. Our study also shows that the dimensions of feature space 20(2) = 400 (for dipeptide) and 20(3) = 8 000 (for tripeptide) give nearly the same prediction accuracy. Among these 4 structural classes, multi-class classification gives an overall accuracy of about 52%, indicating that the multi-class classification technique in support of vector machines may still need to be further improved in future investigation.
Collapse
Affiliation(s)
- X-D Sun
- College of Life Science and Biotechnology, Guangxi University, Nanning, Guangxi, China
| | | |
Collapse
|
11
|
Arunachalam J, Kanagasabai V, Gautham N. Protein structure prediction using mutually orthogonal Latin squares and a genetic algorithm. Biochem Biophys Res Commun 2006; 342:424-33. [PMID: 16487483 DOI: 10.1016/j.bbrc.2006.01.162] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Accepted: 01/31/2006] [Indexed: 11/29/2022]
Abstract
We combine a new, extremely fast technique to generate a library of low energy structures of an oligopeptide (by using mutually orthogonal Latin squares to sample its conformational space) with a genetic algorithm to predict protein structures. The protein sequence is divided into oligopeptides, and a structure library is generated for each. These libraries are used in a newly defined mutation operator that, together with variation, crossover, and diversity operators, is used in a modified genetic algorithm to make the prediction. Application to five small proteins has yielded near native structures.
Collapse
Affiliation(s)
- J Arunachalam
- Department of Crystallography and Biophysics, University of Madras, Chennai 600025, India
| | | | | |
Collapse
|
12
|
González-Díaz H, Uriarte E. Proteins QSAR with Markov average electrostatic potentials. Bioorg Med Chem Lett 2005; 15:5088-94. [PMID: 16169216 DOI: 10.1016/j.bmcl.2005.07.056] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2005] [Revised: 06/28/2005] [Accepted: 07/05/2005] [Indexed: 11/30/2022]
Abstract
Classic physicochemical and topological indices have been largely used in small molecules QSAR but less in proteins QSAR. In this study, a Markov model is used to calculate, for the first time, average electrostatic potentials xik for an indirect interaction between aminoacids placed at topologic distances k within a given protein backbone. The short-term average stochastic potential xi1 for 53 Arc repressor mutants was used to model the effect of Alanine scanning on thermal stability. The Arc repressor is a model protein of relevance for biochemical studies on bioorganics and medicinal chemistry. A linear discriminant analysis model developed correctly classified 43 out of 53, 81.1% of proteins according to their thermal stability. More specifically, the model classified 20/28, 71.4% of proteins with near wild-type stability and 23/25, 92.0% of proteins with reduced stability. Moreover, predictability in cross-validation procedures was of 81.0%. Expansion of the electrostatic potential in the series xi0, xi1, xi2, and xi3, justified the use of the abrupt truncation approach, being the overall accuracy >70.0% for xi0 but equal for xi1, xi2, and xi3. The xi1 model compared favorably with respect to others based on D-Fire potential, surface area, volume, partition coefficient, and molar refractivity, with less than 77.0% of accuracy [Ramos de Armas, R.; González-Díaz, H.; Molina, R.; Uriarte, E. Protein Struct. Func. Bioinf.2004, 56, 715]. The xi1 model also has more tractable interpretation than others based on Markovian negentropies and stochastic moments. Finally, the model is notably simpler than the two models based on quadratic and linear indices. Both models, reported by Marrero-Ponce et al., use four-to-five time more descriptors. Introduction of average stochastic potentials may be useful for QSAR applications; having xik amenable physical interpretation and being very effective.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela 15782, Spain.
| | | |
Collapse
|
13
|
Stochastic molecular descriptors for polymers. 3. Markov electrostatic moments as polymer 2D-folding descriptors: RNA–QSAR for mycobacterial promoters. POLYMER 2005. [DOI: 10.1016/j.polymer.2005.04.104] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
14
|
González-Díaz H, Saíz-Urra L, Molina R, Uriarte E. Stochastic molecular descriptors for polymers. 2. Spherical truncation of electrostatic interactions on entropy based polymers 3D-QSAR. POLYMER 2005. [DOI: 10.1016/j.polymer.2005.01.066] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
15
|
González-Díaz H, Cruz-Monteagudo M, Molina R, Tenorio E, Uriarte E. Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model. Bioorg Med Chem 2005; 13:1119-29. [PMID: 15670920 DOI: 10.1016/j.bmc.2004.11.030] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2004] [Revised: 11/09/2004] [Accepted: 11/12/2004] [Indexed: 10/26/2022]
Abstract
Most of present molecular descriptors just consider the molecular structure. In the present article we pretend extending the use of Markov chain models to define novel molecular descriptors, which consider in addition to molecular structure other parameters like target site or toxic effect. Specifically, this molecular descriptor takes into consideration not only the molecular structure but the specific system the drug affects too. Herein, it is developed a general Markov model that describes 39 different drugs side effects grouped in 11 affected systems for 301 drugs, being 686 cases finally. The data was processed by linear discriminant analysis (LDA) classifying drugs according to their specific side effects, forward stepwise was fixed as strategy for variables selection. The average percentage of good classification and number of compounds used in the training/predicting sets were 100/100% for systemic phenomena (47 out of 47)/(12 out of 12) and metabolic (18 out of 18)/(5 out of 5), muscular-skeletal (23 out of 23)/(6 out of 6) and neurological manifestations (33 out of 33)/(8 out of 8); 97.6/96.7% for cardiovascular manifestation (122 out of 125)/(30 out of 31); 97.1/97.5% for breathing manifestations (34 out of 35)/(8 out of 9); 97/99.4% for gastrointestinal manifestations (159 out of 164)/(40 out of 41); 97/95% for endocrine manifestations (32 out of 33)/(7 out of 8); 96.4/94.6% for psychiatric manifestations (53 out of 55)/(13 out of 14); 95.1/99.1% for hematological manifestations (98 out of 103)/(25 out of 26) and 88/92.3% for dermal manifestations (44 out of 50)/(12 out of 13). In addition, we report preliminary experimental reversible decrease of lymphocytes differential count after administration of the antibacterial drug G-1 in mice, which coincide with a posterior probability (P%=74.91) predicted by the model. This article develops a model that encompasses a large number of side effects grouped in specific organ systems in a single stochastic framework for the first time.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela 15782, Spain
| | | | | | | | | |
Collapse
|
16
|
de Armas RR, Díaz HG, Molina R, Uriarte E. Stochastic-based descriptors studying biopolymers biological properties: Extended MARCH-INSIDE methodology describing antibacterial activity of lactoferricin derivatives. Biopolymers 2005; 77:247-56. [PMID: 15682438 DOI: 10.1002/bip.20202] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Lactoferricin are a number of related peptides derived from the enzymatic cleavage of lactoferrin, an iron-binding protein. These peptides, and other peptides derived from them by simple amino acid substitutions, have shown interesting antibacterial activity. In this paper we applied the MARCH-INSIDE methodology extended to peptide and proteins, to a QSAR study related to antibacterial activity of 31 derivatives of lactoffericin against E. Coli and S. Aureus by means of Linear Discriminant (LDA) and Multiple Linear Regression Analysis (MLR). In the case of LDA we obtained models that classify correctly more than 80% of all cases (85.7% for E. Coli antibacterial activity and 83.9 for S. Aureus). With the application of a Leave-One-Out Cross Validation Procedure, the percentage of good classification of both classification models remained near the above reported values (87.1% for E. Coli antibacterial activity and 83.9 for S. Aureus). We obtained several linear regression models taking into account total and local descriptors. The inclusion of those local descriptors improved the correlation parameters, the statistical quality, and the predictive power of the former model obtained only with total descriptors. The best models explained more than 80% of the experimental variance in the antimicrobial activity of those compounds. These results are comparable with those reported previously by Strom (Strom, M. B.; Rekdal, O.; Svendesen, J. S. J Peptide Res 2001, 57, 127-139.) and Tore-Lejon (Lejon, T.; Strom, M.; Svendsen, S. J Protein Sci 2001, 7, 74-78.; Lejon, T.; Svendsen J. S.; Haug, B. E. J Peptide Sci 2002, 8, 302-306.) in a smaller dataset applying Z-scales and volume-based descriptors and PLS as statistical techniques.
Collapse
|
17
|
González-Díaz H, Uriarte E, Ramos de Armas R. Predicting stability of Arc repressor mutants with protein stochastic moments. Bioorg Med Chem 2005; 13:323-31. [PMID: 15598555 DOI: 10.1016/j.bmc.2004.10.024] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2004] [Revised: 10/08/2004] [Accepted: 10/09/2004] [Indexed: 11/18/2022]
Abstract
As more and more protein structures are determined and applied to drug manufacture, there is increasing interest in studying their stability. In this study, the stochastic moments ((SR)pi(k)) of 53 Arc repressor mutants were introduced as molecular descriptors modeling protein stability. The Linear Discriminant Analysis model developed correctly classified 43 out of 53, 81.13% of proteins according to their thermal stability. More specifically, the model classified 20/28 (71.4%) proteins with near wild-type stability and 23/25 (92%) proteins with reduced stability. Moreover, validation of the model was carried out by re-substitution procedures (81.0%). In addition, the stochastic moments based model compared favorably with respect to others based on physicochemical and geometric parameters such as D-Fire potential, surface area, volume, partition coefficient, and molar refractivity, which presented less than 77% of accuracy. This result illustrates the possibilities of the stochastic moments' method for the study of bioorganic and medicinal chemistry relevant proteins.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela 15706, Spain.
| | | | | |
Collapse
|
18
|
Ramos de Armas R, González Díaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins 2004; 56:715-23. [PMID: 15281125 DOI: 10.1002/prot.20159] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
As more and more protein structures are determined and applied to drug manufacture, there is increasing interest in studying their stability. In this sense, developing novel computational methods to predict and study protein stability in relation to their amino acid sequences has become a significant goal in applied Proteomics. In the study described here, Markovian Backbone Negentropies (MBN) have been introduced in order to model the effect on protein stability of a complete set of alanine substitutions in the Arc repressor. A total of 53 proteins were studied by means of Linear Discriminant Analysis using MBN as molecular descriptors. MBN are molecular descriptors based on a Markov chain model of electron delocalization throughout the protein backbone. The model correctly classified 43 out of 53 (81.13%) proteins according to their thermal stability. More specifically, the model classified 20/28 (71.4%) proteins with near wild-type stability and 23/25 (92%) proteins with reduced stability. Moreover, the model presented a good Mathew's regression coefficient of 0.643. Validation of the model was carried out by several Jackknife procedures. The method compares favorably with surface-dependent and thermodynamic parameter stability scoring functions. For instance, the D-FIRE potential classification function shows a level of good classification of 76.9%. On the other hand, surface, volume, logP, and molar refractivity show accuracies of 70.7, 62.3, 59.0, and 60.0%, respectively.
Collapse
|
19
|
Gonzáles-Díaz H, Gia O, Uriarte E, Hernádez I, Ramos R, Chaviano M, Seijo S, Castillo JA, Morales L, Santana L, Akpaloo D, Molina E, Cruz M, Torres LA, Cabrera MA. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J Mol Model 2003; 9:395-407. [PMID: 13680309 DOI: 10.1007/s00894-003-0148-7] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2003] [Accepted: 07/07/2003] [Indexed: 10/26/2022]
Abstract
A simple stochastic approach, designed to model the movement of electrons throughout chemical bonds, is introduced. This model makes use of a Markov matrix to codify useful structural information in QSAR. The self-return probabilities of this matrix throughout time ((SR)pi(k)) are then used as molecular descriptors. Firstly, a calculation of (SR)pi(k) is made for a large series of anticancer and non-anticancer chemicals. Then, k-Means Cluster Analysis allows us to split the data series into clusters and ensure a representative design of training and predicting series. Next, we develop a classification function through Linear Discriminant Analysis (LDA). This QSAR discriminates between anticancer compounds and non-active compounds with a correct global classification of 90.5% in the training series. The model also correctly classified 86.07% of the compounds in the predicting series. This classification function is then used to perform a virtual screening of a combinatorial library of coumarins. In this connection, the biological assay of some furocoumarins, selected by virtual screening using the present model, gives good results. In particular, a tetracyclic derivative of 5-methoxypsoralen (5-MOP) has an IC50 against HL-60 tumoral line around 6 to 10 times lower than those for 8-MOP and 5-MOP (reference drugs), respectively. Finally, application of Iso-contribution Zone Analysis (IZA) provides structural interpretation of the biological activity predicted with this QSAR.
Collapse
Affiliation(s)
- Humberto Gonzáles-Díaz
- Chemical Bioactives Center, Central University of Las Villas, 54830 Santa Clara, Villa Clara, Cuba.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Cowen L, Bradley P, Menke M, King J, Berger B. Predicting the beta-helix fold from protein sequence data. J Comput Biol 2002; 9:261-76. [PMID: 12015881 DOI: 10.1089/10665270252935458] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A method is presented that uses beta-strand interactions to predict the parallel right-handed beta-helix super-secondary structural motif in protein sequences. A program called BetaWrap implements this method and is shown to score known beta-helices above non-beta-helices in the Protein Data Bank in cross-validation. It is demonstrated that BetaWrap learns each of the seven known SCOP beta-helix families, when trained primarily on beta-structures that are not beta-helices, together with structural features of known beta-helices from outside the family. BetaWrap also predicts many bacterial proteins of unknown structure to be beta-helices; in particular, these proteins serve as virulence factors, adhesins, and toxins in bacterial pathogenesis and include cell surface proteins from Chlamydia and the intestinal bacterium Helicobacter pylori. The computational method used here may generalize to other beta-structures for which strand topology and profiles of residue accessibility are well conserved.
Collapse
Affiliation(s)
- Lenore Cowen
- Department of EECS, Tufts University, Medford, MA 02155, USA
| | | | | | | | | |
Collapse
|
21
|
Steward RE, Thornton JM. Prediction of strand pairing in antiparallel and parallel beta-sheets using information theory. Proteins 2002; 48:178-91. [PMID: 12112687 DOI: 10.1002/prot.10152] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
An information theory approach was developed to predict the alignment of interacting antiparallel and parallel beta-strands. Information scores were derived for the preference of a residue on a beta-strand to be opposite a sequence of residues on an adjacent beta-strand. These scores were used to predict the interstrand register of interacting beta-strands from 10 alternative offset positions either side of the experimentally observed beta-sheet register. The amino acid sequence of an internal beta-strand can be correctly aligned with two beta-strands in a fixed position either side of the strand in 45% of antiparallel and 48% of parallel arrangements. For comparison, when another beta-strand from a nonhomologous protein substitutes the internal beta-strand, the same register is predicted for only 24 and 36% of antiparallel and parallel arrangements. As expected, alignment of a single fixed strand with just a second beta-strand sequence was more difficult, and gave a correct register in 31 and 37% of antiparallel and parallel beta-pairs, respectively. These scores are 10% higher than for two randomly selected beta-strand sequences. In general, prediction accuracy was not improved by information tables that distinguished hydrogen-bonding patterns or beta-strand order. These results will contribute to predicting the arrangement of beta-strands in beta-pleated sheets and protein topology.
Collapse
Affiliation(s)
- Robert E Steward
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
| | | |
Collapse
|
22
|
|
23
|
Bradley P, Cowen L, Menke M, King J, Berger B. BETAWRAP: successful prediction of parallel beta -helices from primary sequence reveals an association with many microbial pathogens. Proc Natl Acad Sci U S A 2001; 98:14819-24. [PMID: 11752429 PMCID: PMC64942 DOI: 10.1073/pnas.251267298] [Citation(s) in RCA: 86] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2001] [Indexed: 11/18/2022] Open
Abstract
The amino acid sequence rules that specify beta-sheet structure in proteins remain obscure. A subclass of beta-sheet proteins, parallel beta-helices, represent a processive folding of the chain into an elongated topologically simpler fold than globular beta-sheets. In this paper, we present a computational approach that predicts the right-handed parallel beta-helix supersecondary structural motif in primary amino acid sequences by using beta-strand interactions learned from non-beta-helix structures. A program called BETAWRAP (http://theory.lcs.mit.edu/betawrap) implements this method and recognizes each of the seven known parallel beta-helix families, when trained on the known parallel beta-helices from outside that family. BETAWRAP identifies 2,448 sequences among 595,890 screened from the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/) nonredundant protein database as likely parallel beta-helices. It identifies surprisingly many bacterial and fungal protein sequences that play a role in human infectious disease; these include toxins, virulence factors, adhesins, and surface proteins of Chlamydia, Helicobacteria, Bordetella, Leishmania, Borrelia, Rickettsia, Neisseria, and Bacillus anthracis. Also unexpected was the rarity of the parallel beta-helix fold and its predicted sequences among higher eukaryotes. The computational method introduced here can be called a three-dimensional dynamic profile method because it generates interstrand pairwise correlations from a processive sequence wrap. Such methods may be applicable to recognizing other beta structures for which strand topology and profiles of residue accessibility are well conserved.
Collapse
Affiliation(s)
- P Bradley
- Mathematics Department and Laboratory for Computer Science, and Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | | | | | | |
Collapse
|
24
|
Abstract
Methods predicting protein secondary structure improved substantially in the 1990s through the use of evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height of around 76% of all residues predicted correctly in one of the three states, helix, strand, and other. The past year also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simple combining of existing methods. Divergent evolutionary profiles contain enough information not only to substantially improve prediction accuracy, but also to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on nonlocal conditions. An example is a method automatically identifying structural switches and thus finding a remarkable connection between predicted secondary structure and aspects of function. Secondary structure predictions are increasingly becoming the work horse for numerous methods aimed at predicting protein structure and function. Is the recent increase in accuracy significant enough to make predictions even more useful? Because the recent improvement yields a better prediction of segments, and in particular of beta strands, I believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.
Collapse
Affiliation(s)
- B Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, New York 10032, USA
| |
Collapse
|
25
|
Olmea O, Rost B, Valencia A. Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 1999; 293:1221-39. [PMID: 10547297 DOI: 10.1006/jmbi.1999.3208] [Citation(s) in RCA: 131] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequence-based approaches.
Collapse
Affiliation(s)
- O Olmea
- Protein Design Group, CNB-CSIC, Cantoblanco, Madrid, E-28049, Spain
| | | | | |
Collapse
|
26
|
|
27
|
Morea V, Leplae R, Tramontano A. Protein structure prediction and design. BIOTECHNOLOGY ANNUAL REVIEW 1999; 4:177-214. [PMID: 9890141 DOI: 10.1016/s1387-2656(08)70070-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Proteins have a unique native conformation, which can be proven in many instances to be determined by the amino acid sequence alone. The folding problem, that is the understanding of how the amino acid sequence directs folding, is still unsolved, despite more than 30 years of effort. However, many new methods have appeared in the past few years. This chapter describes the different principles underlying them and tries to give an overview of their successes and pitfalls.
Collapse
Affiliation(s)
- V Morea
- IRBM P. Angeletti, Pomezia, Rome, Italy
| | | | | |
Collapse
|
28
|
Hutchinson EG, Sessions RB, Thornton JM, Woolfson DN. Determinants of strand register in antiparallel beta-sheets of proteins. Protein Sci 1998; 7:2287-300. [PMID: 9827995 PMCID: PMC2143855 DOI: 10.1002/pro.5560071106] [Citation(s) in RCA: 151] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Antiparallel beta-sheets present two distinct environments to inter-strand residue pairs: beta(A,HB) sites have two backbone hydrogen bonds; whereas at beta(A,NHB) positions backbone hydrogen bonding is precluded. We used statistical methods to compare the frequencies of amino acid pairs at each site. Only approximately 10% of the 210 possible pairs showed occupancies that differed significantly between the two sites. Trends were clear in the preferred pairs, and these could be explained using stereochemical arguments. Cys-Cys, Aromatic-Pro, Thr-Thr, and Val-Val pairs all preferred the beta(A,NHB) site. In each case, the residues usually adopted sterically favored chi1 conformations, which facilitated intra-pair interactions: Cys-Cys pairs formed disulfide bonds; Thr-Thr pairs made hydrogen bonds; Aromatic-Pro and Val-Val pairs formed close van der Waals contacts. In contrast, to make intimate interactions at a beta(A,HB) site, one or both residues had to adopt less favored chi1 geometries. Nonetheless, pairs containing glycine and/or aromatic residues were favored at this site. Where glycine and aromatic side chains combined, the aromatic residue usually adopted the gauche conformation, which promoted novel aromatic ring-peptide interactions. This work provides rules that link protein sequence and tertiary structure, which will be useful in protein modeling, redesign, and de novo design. Our findings are discussed in light of previous analyses and experimental studies.
Collapse
Affiliation(s)
- E G Hutchinson
- Department of Biochemistry and Molecular Biology, University College, London, United Kingdom
| | | | | | | |
Collapse
|
29
|
Ammendola S, Politi L, Scandurra R. Cloning and sequencing of ISC1041 from the archaeon Sulfolobus solfataricus MT-4, a new member of the IS30 family of insertion elements. FEBS Lett 1998; 428:217-23. [PMID: 9654137 DOI: 10.1016/s0014-5793(98)00530-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
A genomic fragment containing the insertion sequence ISC1041 has been cloned by PCR from the archaeon Sulfolobus solfaricus MT-4, an extremophilic microorganism which grows at 87 degrees C. The 1038 bp ISC1041 element contains an imperfect 18 nt repeat and a long open reading frame which encodes a polypeptide of 311 amino acid residues. The translated amino acid sequence shows a significant similarity to IS30-like transposases. Structural analysis indicates that ISC1041 is a novel member of the IS30 family and displays the DDE motif not previously seen in Archaea. This motif is believed to be involved in the integration mechanism of many mobile elements. As this motif is present in several integrases and transposases which, despite the lack of overall protein homologies, share topological homologies to the DDE motif, a common ancestor has been proposed. The finding of an IS30-like transposase in the archaeal kingdom may have relevance for horizontal gene transfer.
Collapse
Affiliation(s)
- S Ammendola
- Dipartimento di Scienze Biochimiche, Università di Roma La Sapienza, Rome, Italy
| | | | | |
Collapse
|
30
|
Ortiz AR, Kolinski A, Skolnick J. Fold assembly of small proteins using monte carlo simulations driven by restraints derived from multiple sequence alignments. J Mol Biol 1998; 277:419-48. [PMID: 9514747 DOI: 10.1006/jmbi.1997.1595] [Citation(s) in RCA: 73] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The feasibility of predicting the global fold of small proteins by incorporating predicted secondary and tertiary restraints into ab initio folding simulations has been demonstrated on a test set comprised of 20 non-homologous proteins, of which one was a blind prediction of target 42 in the recent CASP2 contest. These proteins contain from 37 to 100 residues and represent all secondary structural classes and a representative variety of global topologies. Secondary structure restraints are provided by the PHD secondary structure prediction algorithm that incorporates multiple sequence information. Predicted tertiary restraints are derived from multiple sequence alignments via a two-step process. First, seed side-chain contacts are identified from correlated mutation analysis, and then a threading-based algorithm is used to expand the number of these seed contacts. A lattice-based reduced protein model and a folding algorithm designed to incorporate these predicted restraints is described. Depending upon fold complexity, it is possible to assemble native-like topologies whose coordinate root-mean-square deviation from native is between 3.0 A and 6.5 A. The requisite level of accuracy in side-chain contact map prediction can be roughly 25% on average, provided that about 60% of the contact predictions are correct within +/-1 residue and 95% of the predictions are correct within +/-4 residues. Precision in tertiary contact prediction is more critical than absolute accuracy. Furthermore, only a subset of the tertiary contacts, on the order of 25% of the total, is sufficient for successful topology assembly. Overall, this study suggests that the use of restraints derived from multiple sequence alignments combined with a fold assembly algorithm holds considerable promise for the prediction of the global topology of small proteins.
Collapse
Affiliation(s)
- A R Ortiz
- TPC-5, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA
| | | | | |
Collapse
|
31
|
Abstract
The two-dimensional contact map of interresidue distances is a visual analysis technique for protein structures. We present two standalone software tools designed to be used in combination to increase the versatility of this simple yet powerful technique. First, the program Structer calculates contact maps from three-dimensional molecular structural data. The contact map matrix can then be viewed in the graphical matrix-visualization program Dotter. Instead of using a predefined distance cutoff, we exploit Dotter's dynamic rendering control, allowing interactive exploration at varying distance cutoffs after calculating the matrix once. Structer can use a number of distance measures, can incorporate multiple chains in one contact map, and allows masking of user-defined residue sets. It works either directly with PDB files, or can use the MMDB network API for reading structures.
Collapse
Affiliation(s)
- E L Sonnhammer
- Computational Biology Branch, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | |
Collapse
|
32
|
Turcotte M, Muggleton SH, Sternberg MJE. Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure. INDUCTIVE LOGIC PROGRAMMING 1998. [DOI: 10.1007/bfb0027310] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
33
|
Benner SA, Cannarozzi G, Gerloff D, Turcotte M, Chelvanayagam G. Bona Fide Predictions of Protein Secondary Structure Using Transparent Analyses of Multiple Sequence Alignments. Chem Rev 1997; 97:2725-2844. [PMID: 11851479 DOI: 10.1021/cr940469a] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Steven A. Benner
- Department of Chemistry, University of Florida, Gainesville, Florida 32611-7200
| | | | | | | | | |
Collapse
|
34
|
Abstract
If protein structure prediction methods are to make any impact on the impending onerous task of analyzing the large numbers of unknown protein sequences generated by the ongoing genome-sequencing projects, it is vital that they make the difficult transition from computational 'gedankenexperiments' to practical software tools. This has already happened in the field of comparative modelling and is currently happening in the threading field. Unfortunately, there is little evidence of this transition happening in the field of ab initio tertiary-structure prediction.
Collapse
Affiliation(s)
- D T Jones
- Department of Biological Sciences, University of Warwick, Coventry, UK.
| |
Collapse
|
35
|
Abstract
The accuracy of secondary structure prediction methods has been improved significantly by the use of aligned protein sequences. The PHD method and the NNSSP method reach 71 to 72% of sustained overall three-state accuracy when multiple sequence alignments are with neural networks and nearest-neighbor algorithms, respectively. We introduce a variant of the nearest-neighbor approach that can achieve similar accuracy using a single sequence as the query input. We compute the 50 best non-intersecting local alignments of the query sequence with each sequence from a set of proteins with known 3D structures. Each position of the query sequence is aligned with the database amino acids in alpha-helical, beta-strand or coil states. The prediction type of secondary structure is selected as the type of aligned position with the maximal total score. On the dataset of 124 non-membrane non-homologous proteins, used earlier as a benchmark for secondary structure predictions, our method reaches an overall three-state accuracy of 71.2%. The performance accuracy is verified by an additional test on 461 non-homologous proteins giving an accuracy of 71.0%. The main strength of the method is the high level of prediction accuracy for proteins without any known homolog. Using multiple sequence alignments as input the method has a prediction accuracy of 73.5%. Prediction of secondary structure by the SSPAL method is available via Baylor College of Medicine World Wide Web server.
Collapse
Affiliation(s)
- A A Salamov
- Department of Cell Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
36
|
Rice DW, Eisenberg D. A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J Mol Biol 1997; 267:1026-38. [PMID: 9135128 DOI: 10.1006/jmbi.1997.0924] [Citation(s) in RCA: 131] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
In protein fold recognition, a probe amino acid sequence is compared to a library of representative folds of known structure to identify a structural homolog. In cases where the probe and its homolog have clear sequence similarity, traditional residue substitution matrices have been used to predict the structural similarity. In cases where the probe is sequentially distant from its homolog, we have developed a (7 x 3 x 2 x 7 x 3) 3D-1D substitution matrix (called H3P2), calculated from a database of 119 structural pairs. Members of each pair share a similar fold, but have sequence identity less than 30%. Each probe sequence position is defined by one of seven residue classes and three secondary structure classes. Each homologous fold position is defined by one of seven residue classes, three secondary structure classes, and two burial classes. Thus the matrix is five-dimensional and contains 7 x 3 x 2 x 7 x 3 = 882 elements or 3D-1D scores. The first step in assigning a probe sequence to its homologous fold is the prediction of the three-state (helix, strand, coil) secondary structure of the probe; here we use the profile based neural network prediction of secondary structure (PHD) program. Then a dynamic programming algorithm uses the H3P2 matrix to align the probe sequence with structures in a representative fold library. To test the effectiveness of the H3P2 matrix a challenging, fold class diverse, and cross-validated benchmark assessment is used to compare the H3P2 matrix to the GONNET, PAM250, BLOSUM62 and a secondary structure only substitution matrix. For distantly related sequences the H3P2 matrix detects more homologous structures at higher reliabilities than do these other substitution matrices, based on sensitivity versus specificity plots (or SENS-SPEC plots). The added efficacy of the H3P2 matrix arises from its information on the statistical preferences for various sequence-structure environment combinations from very distantly related proteins. It introduces the predicted secondary structure information from a sequence into fold recognition in a statistical way that normalizes the inherent correlations between residue type, secondary structure and solvent accessibility.
Collapse
Affiliation(s)
- D W Rice
- UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, UCLA, Los Angeles, CA 90095-1570, USA
| | | |
Collapse
|
37
|
Abstract
An ever increasing number of protein sequences are being compared, partly because of the availability of full sets of protein sequences from several completed genome-sequencing projects. The resulting problem of scale has shifted the emphasis of sequence analysis method development from sensitivity and flexibility, which relies on manual intervention and interpretation, to the automatic generation of results of known reliability.
Collapse
Affiliation(s)
- T J Hubbard
- Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
| |
Collapse
|
38
|
Di Francesco V, Garnier J, Munson PJ. Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. J Mol Biol 1997; 267:446-63. [PMID: 9096237 DOI: 10.1006/jmbi.1996.0874] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The three-dimensional fold of a protein is described by the organization of its secondary structure elements in 3D space, i.e. its "topology". We find that the protein topology can be recognized from the ID sequence of secondary structure states of the residues alone. Automated recognition is facilitated by use of hidden Markov models (HMMs) to represent topology families of proteins. Such models can be trained on the experimentally observed secondary structure sequences of family members using well established algorithms. Here, we model various topology groups in the alpha class of proteins and identify, from a large database, those proteins having the topology described by each model. The correct topology family for protein secondary structure sequences could be recognized 12 out of 14 times. When the observed secondary structure sequences are replaced with predicted sequences recognition is still achievable 8 out of 14 times. The success rate for observed sequences indicates that our approach will become increasingly useful as the accuracy of secondary prediction algorithms is improved. Our study indicates that the HMMs are useful for protein topology recognition even when no detectable primary amino acid sequence similarity is present. To illustrate the potential utility of our method, protein topology recognition is attempted on leptin, the obese gene product, and the human interleukin-6 sequence, for which fold predictions have been previously published.
Collapse
Affiliation(s)
- V Di Francesco
- Laboratory of Structural Biology, Division of Computer Research and Technology, National Institutes of Health, Bethesda, MD 20892-5626, USA.
| | | | | |
Collapse
|
39
|
Abstract
The computational techniques of sorting out protein folds (these techniques include dynamic programming, self-consistent field theory, etc.) have already ceased to be the bottleneck of predictions. The main problem is that all the methods of recognition and prediction of protein structure can actually use only some part of the interactions operating in the chain, and that even their energies are not known precisely. This is the principal source of errors now. The errors can be reduced by employment of many distant homologues, but this opens a possibility to predict a generalized folding pattern rather than a particular fold with all its details.
Collapse
Affiliation(s)
- A V Finkelstein
- Institute of Protein Research, Russian Academy of Sciences, 142292 Pushchino, Moscow Region, Russia.
| |
Collapse
|
40
|
Bycroft M, Hubbard TJ, Proctor M, Freund SM, Murzin AG. The solution structure of the S1 RNA binding domain: a member of an ancient nucleic acid-binding fold. Cell 1997; 88:235-42. [PMID: 9008164 DOI: 10.1016/s0092-8674(00)81844-9] [Citation(s) in RCA: 336] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
The S1 domain, originally identified in ribosomal protein S1, is found in a large number of RNA-associated proteins. The structure of the S1 RNA-binding domain from the E. coli polynucleotide phosphorylase has been determined using NMR methods and consists of a five-stranded antiparallel beta barrel. Conserved residues on one face of the barrel and adjacent loops form the putative RNA-binding site. The structure of the S1 domain is very similar to that of cold shock protein, suggesting that they are both derived from an ancient nucleic acid-binding protein. Enhanced sequence searches reveal hitherto unidentified S1 domains in RNase E, RNase II, NusA, EMB-5, and other proteins.
Collapse
Affiliation(s)
- M Bycroft
- Department of Chemistry, University of Cambridge, United Kingdom
| | | | | | | | | |
Collapse
|
41
|
Olmea O, Valencia A. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. FOLDING & DESIGN 1997; 2:S25-32. [PMID: 9218963 DOI: 10.1016/s1359-0278(97)00060-6] [Citation(s) in RCA: 157] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
We have previously developed a method for predicting interresidue contacts using information about correlated mutations in multiple sequence alignments. The predictions generated with this method were clearly better than random but not enough for their use in de novo protein folding experiments. We assess the possibility of improving contact predictions combining information from the following variables: correlated mutations, sequence conservation, sequence separation along the chain, alignment stability, family size, residue-specific contact occupancy and formation of contact networks. The application of a protocol for combining these independent variables leads to contact predictions that are on average two times better than those obtained initially with correlated mutations. Correlated mutations can be effectively combined with other types of information derived from multiple sequence alignments. Among the different variables tried, sequence conservation and contact density are particularly relevant for the combination with correlated mutations.
Collapse
Affiliation(s)
- O Olmea
- Protein Design Group, CNB-CSIC, Campus U Autonoma, Cantoblanco, Madrid, Spain
| | | |
Collapse
|
42
|
Bohr J, Bohr H, Brunak S. Protein folding and wring resonances. Biophys Chem 1997; 63:97-105. [PMID: 17029822 DOI: 10.1016/s0301-4622(96)02249-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/1996] [Revised: 08/19/1996] [Accepted: 10/17/1996] [Indexed: 10/18/2022]
Abstract
The polypeptide chain of a protein is shown to obey topological constraints which enable long range excitations in the form of wring modes of the protein backbone. Wring modes of proteins of specific lengths can therefore resonate with molecular modes present in the cell. It is suggested that protein folding takes place when the amplitude of a wring excitation becomes so large that it is energetically favorable to bend the protein backbone. The condition under which such structural transformations can occur is found, and it is shown that both cold and hot denaturation (the unfolding of proteins) are natural consequences of the suggested wring mode model. Native (folded) proteins are found to possess an intrinsic standing wring mode.
Collapse
Affiliation(s)
- J Bohr
- Physics Department, Building 307, The Technical University of Denmark, DK-2800 Lyngby, Denmark.
| | | | | |
Collapse
|
43
|
Di Francesco V, Geetha V, Garnier J, Munson PJ. Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds. Proteins 1997. [DOI: 10.1002/(sici)1097-0134(1997)1+<123::aid-prot16>3.0.co;2-q] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
44
|
Abstract
Considerable progress has been made in understanding the relationship between local amino acid sequence and local protein structure. Recent highlights include numerous studies of the structures adopted by short peptides, new approaches to correlating sequence patterns with structure patterns, and folding simulations using simple potentials.
Collapse
Affiliation(s)
- C Bystroff
- Department of Biochemistry, University of Washington, Seattle 98195, USA.
| | | | | | | |
Collapse
|
45
|
Protein structure prediction:playing the fold. Trends Biochem Sci 1996. [DOI: 10.1016/s0968-0004(96)20018-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
46
|
Abstract
The capabilities of current protein structure prediction methods have been assessed from the outcome of a set of blind tests. In comparative modeling, many of the numerical methods did not perform as well as expected, although the resulting structures are still of great practical use. The new methods of fold identification ('threading') were partially successful, and show considerable promise for the future. Except for secondary structure data, results from traditional ab initio methods were poor. A second blind prediction experiment is underway, and progress in all areas is expected.
Collapse
Affiliation(s)
- J Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville 20850, USA.
| |
Collapse
|
47
|
Abstract
Every sequence comparison method requires a set of scores. For aligning protein sequences, substitution scores are based on models of amino acid conservation and properties, and matrices of these scores have substantially improved in recent years. Position-specific scoring matrices provide representations of sequence families that are capable of detecting subtle similarities. Comprehensive evaluations can effectively guide the choice of scores for sequence alignment and searching applications, including those that aid in the prediction of protein structures.
Collapse
Affiliation(s)
- S Henikoff
- Howard Hughes Medical Institute, Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98104, USA.
| |
Collapse
|
48
|
Affiliation(s)
- B Rost
- Protein Design Group, European Molecular Biology Laboratory, Heidelburg, Germany
| |
Collapse
|
49
|
Hubbard T, Tramontano A. Update on protein structure prediction: results of the 1995 IRBM workshop. FOLDING & DESIGN 1996; 1:R55-63. [PMID: 9079378 DOI: 10.1016/s1359-0278(96)00028-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Computational tools for protein structure prediction are of great interest to molecular, structural and theoretical biologists due to a rapidly increasing number of protein sequences with no known structure. In October 1995, a workshop was held at IRBM to predict as much as possible about a number of proteins of biological interest using ab initio prediction of fold recognition methods. 112 protein sequences were collected via an open invitation for target submissions. 17 were selected for prediction during the workshop and for 11 of these a prediction of some reliability could be made. We believe that this was a worthwhile experiment showing that the use of a range of independent prediction methods and thorough use of existing databases can lead to credible and useful ab initio structure predictions.
Collapse
Affiliation(s)
- T Hubbard
- Instituto di Ricerche di Biologia Molecolare (IRBM), Pomezia, Roma, Italy.
| | | |
Collapse
|