1
|
Yan K, Wen J, Liu JX, Xu Y, Liu B. Protein Fold Recognition by Combining Support Vector Machines and Pairwise Sequence Similarity Scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2008-2016. [PMID: 31940548 DOI: 10.1109/tcbb.2020.2966450] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein fold recognition is one of the most essential steps for protein structure prediction, aiming to classify proteins into known protein folds. There are two main computational approaches: one is the template-based method based on the alignment scores between query-template protein pairs and the other is the machine learning method based on the feature representation and classifier. These two approaches have their own advantages and disadvantages. Can we combine these methods to establish more accurate predictors for protein fold recognition? In this study, we made an initial attempt and proposed two novel algorithms: TSVM-fold and ESVM-fold. TSVM-fold was based on the Support Vector Machines (SVMs), which utilizes a set of pairwise sequence similarity scores generated by three complementary template-based methods, including HHblits, SPARKS-X, and DeepFR. These scores measured the global relationships between query sequences and templates. The comprehensive features of the attributes of the sequences were fed into the SVMs for the prediction. Then the TSVM-fold was further combined with the HHblits algorithm so as to improve its generalization ability. The combined method is called ESVM-fold. Experimental results in two rigorous benchmark datasets (LE and YK datasets) showed that the proposed methods outperform some state-of-the-art methods, indicating that the TSVM-fold and ESVM-fold are efficient predictors for protein fold recognition.
Collapse
|
2
|
Orlando G, Raimondi D, Vranken WF. Observation selection bias in contact prediction and its implications for structural bioinformatics. Sci Rep 2016; 6:36679. [PMID: 27857150 PMCID: PMC5114557 DOI: 10.1038/srep36679] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Accepted: 10/18/2016] [Indexed: 01/14/2023] Open
Abstract
Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.
Collapse
Affiliation(s)
- G Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, Belgium.,Structural Biology Research Center, VIB, 1050 Brussels, Belgium
| | - D Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, Belgium.,Structural Biology Research Center, VIB, 1050 Brussels, Belgium
| | - W F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, Belgium.,Structural Biology Research Center, VIB, 1050 Brussels, Belgium
| |
Collapse
|
3
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
4
|
Cai XH, Jaroszewski L, Wooley J, Godzik A. Internal organization of large protein families: relationship between the sequence, structure, and function-based clustering. Proteins 2011; 79:2389-402. [PMID: 21671455 PMCID: PMC3132221 DOI: 10.1002/prot.23049] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2010] [Revised: 02/12/2011] [Accepted: 03/13/2011] [Indexed: 12/14/2022]
Abstract
The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects.
Collapse
Affiliation(s)
- Xiao-hui Cai
- Joint Center for Structural Genomics, Bioinformatics Core, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0446, USA
| | - Lukasz Jaroszewski
- Joint Center for Structural Genomics, Bioinformatics Core, Sanford-Burnham Medical Research Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
| | - John Wooley
- Joint Center for Structural Genomics, Bioinformatics Core, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0446, USA
| | - Adam Godzik
- Joint Center for Structural Genomics, Bioinformatics Core, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0446, USA
- Joint Center for Structural Genomics, Bioinformatics Core, Sanford-Burnham Medical Research Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
| |
Collapse
|
5
|
Day R, Qu X, Swanson R, Bohannan Z, Bliss R, Tsai J. Relative Packing Groups in Template-Based Structure Prediction: Cooperative Effects of True Positive Constraints. J Comput Biol 2011; 18:17-26. [DOI: 10.1089/cmb.2010.0078] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Ryan Day
- Chemistry Department, University of the Pacific, Stockton, California
| | | | - Rosemarie Swanson
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas
| | - Zach Bohannan
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California
| | - Robert Bliss
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas
| | - Jerry Tsai
- Chemistry Department, University of the Pacific, Stockton, California
| |
Collapse
|
6
|
Bernsel A, Viklund H, Elofsson A. Remote homology detection of integral membrane proteins using conserved sequence features. Proteins 2008; 71:1387-99. [PMID: 18076048 DOI: 10.1002/prot.21825] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.
Collapse
Affiliation(s)
- Andreas Bernsel
- Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | |
Collapse
|
7
|
Cheng J. A multi-template combination algorithm for protein comparative modeling. BMC STRUCTURAL BIOLOGY 2008; 8:18. [PMID: 18366648 PMCID: PMC2311309 DOI: 10.1186/1472-6807-8-18] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2008] [Accepted: 03/17/2008] [Indexed: 11/26/2022]
Abstract
BACKGROUND Multiple protein templates are commonly used in manual protein structure prediction. However, few automated algorithms of selecting and combining multiple templates are available. RESULTS Here we develop an effective multi-template combination algorithm for protein comparative modeling. The algorithm selects templates according to the similarity significance of the alignments between template and target proteins. It combines the whole template-target alignments whose similarity significance score is close to that of the top template-target alignment within a threshold, whereas it only takes alignment fragments from a less similar template-target alignment that align with a sizable uncovered region of the target. We compare the algorithm with the traditional method of using a single top template on the 45 comparative modeling targets (i.e. easy template-based modeling targets) used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). The multi-template combination algorithm improves the GDT-TS scores of predicted models by 6.8% on average. The statistical analysis shows that the improvement is significant (p-value < 10-4). Compared with the ideal approach that always uses the best template, the multi-template approach yields only slightly better performance. During the CASP7 experiment, the preliminary implementation of the multi-template combination algorithm (FOLDpro) was ranked second among 67 servers in the category of high-accuracy structure prediction in terms of GDT-TS measure. CONCLUSION We have developed a novel multi-template algorithm to improve protein comparative modeling.
Collapse
Affiliation(s)
- Jianlin Cheng
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO 65211-2060, USA.
| |
Collapse
|
8
|
Lee M, Jeong CS, Kim D. Predicting and improving the protein sequence alignment quality by support vector regression. BMC Bioinformatics 2007; 8:471. [PMID: 18053160 PMCID: PMC2222655 DOI: 10.1186/1471-2105-8-471] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Accepted: 12/03/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. RESULTS In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. CONCLUSION The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast.
Collapse
Affiliation(s)
- Minho Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Chan-seok Jeong
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| |
Collapse
|
9
|
Almqvist J, Huang Y, Laaksonen A, Wang DN, Hovmöller S. Docking and homology modeling explain inhibition of the human vesicular glutamate transporters. Protein Sci 2007; 16:1819-29. [PMID: 17660252 PMCID: PMC2206968 DOI: 10.1110/ps.072944707] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
As membrane transporter proteins, VGLUT1-3 mediate the uptake of glutamate into synaptic vesicles at presynaptic nerve terminals of excitatory neural cells. This function is crucial for exocytosis and the role of glutamate as the major excitatory neurotransmitter in the central nervous system. The three transporters, sharing 76% amino acid sequence identity in humans, are highly homologous but differ in regional expression in the brain. Although little is known regarding their three-dimensional structures, hydropathy analysis on these proteins predicts 12 transmembrane segments connected by loops, a topology similar to other members in the major facilitator superfamily, where VGLUT1-3 have been phylogenetically classified. In this work, we present a three-dimensional model for the human VGLUT1 protein based on its distant bacterial homolog in the same superfamily, the glycerol-3-phosphate transporter from Escherichia coli. This structural model, stable during molecular dynamics simulations in phospholipid bilayers solvated by water, reveals amino acid residues that face its pore and are likely to affect substrate translocation. Docking of VGLUT1 substrates to this pore localizes two different binding sites, to which inhibitors also bind with an overall trend in binding affinity that is in agreement with previously published experimental data.
Collapse
Affiliation(s)
- Jonas Almqvist
- Division of Structural Chemistry, Arrhenius Laboratory, Stockholm University, S-10691 Stockholm, Sweden.
| | | | | | | | | |
Collapse
|
10
|
Söding J, Remmert M, Biegert A, Lupas AN. HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res 2006; 34:W374-8. [PMID: 16845029 PMCID: PMC1538784 DOI: 10.1093/nar/gkl195] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
HHsenser is the first server to offer exhaustive intermediate profile searches, which it combines with pairwise comparison of hidden Markov models. Starting from a single protein sequence or a multiple alignment, it can iteratively explore whole superfamilies, producing few or no false positives. The output is a multiple alignment of all detected homologs. HHsenser's sensitivity should make it a useful tool for evolutionary studies. It may also aid applications that rely on diverse multiple sequence alignments as input, such as homology-based structure and function prediction, or the determination of functional residues by conservation scoring and functional subtyping. HHsenser can be accessed at . It has also been integrated into our structure and function prediction server HHpred () to improve predictions for near-singleton sequences.
Collapse
Affiliation(s)
- Johannes Söding
- Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
11
|
Ohlson T, Aggarwal V, Elofsson A, MacCallum RM. Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps. BMC Bioinformatics 2006; 7:357. [PMID: 16869963 PMCID: PMC1562450 DOI: 10.1186/1471-2105-7-357] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2006] [Accepted: 07/25/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein sequence alignment is one of the basic tools in bioinformatics. Correct alignments are required for a range of tasks including the derivation of phylogenetic trees and protein structure prediction. Numerous studies have shown that the incorporation of predicted secondary structure information into alignment algorithms improves their performance. Secondary structure predictors have to be trained on a set of somewhat arbitrarily defined states (e.g. helix, strand, coil), and it has been shown that the choice of these states has some effect on alignment quality. However, it is not unlikely that prediction of other structural features also could provide an improvement. In this study we use an unsupervised clustering method, the self-organizing map, to assign sequence profile windows to "structural states" and assess their use in sequence alignment. RESULTS The addition of self-organizing map locations as inputs to a profile-profile scoring function improves the alignment quality of distantly related proteins slightly. The improvement is slightly smaller than that gained from the inclusion of predicted secondary structure. However, the information seems to be complementary as the two prediction schemes can be combined to improve the alignment quality by a further small but significant amount. CONCLUSION It has been observed in many studies that predicted secondary structure significantly improves the alignments. Here we have shown that the addition of self-organizing map locations can further improve the alignments as the self-organizing map locations seem to contain some information that is not captured by the predicted secondary structure.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Varun Aggarwal
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Arne Elofsson
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Center for Biomembrane Research, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Robert M MacCallum
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
- Division of Cell and Molecular Biology, Imperial College London, London, UK
| |
Collapse
|
12
|
Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006; 22:1456-63. [PMID: 16547073 DOI: 10.1093/bioinformatics/btl102] [Citation(s) in RCA: 136] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. RESULTS Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile-profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is approximately 85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%.
Collapse
Affiliation(s)
- Jianlin Cheng
- Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, CA, USA
| | | |
Collapse
|
13
|
Sandhya S, Chakrabarti S, Abhinandan KR, Sowdhamini R, Srinivasan N. Assessment of a rigorous transitive profile based search method to detect remotely similar proteins. J Biomol Struct Dyn 2005; 23:283-98. [PMID: 16218755 DOI: 10.1080/07391102.2005.10507066] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a 'first generation' search by querying a database. We propagate a 'second generation' search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this 'cascaded' intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein "fold space".
Collapse
Affiliation(s)
- S Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | | | | | | | |
Collapse
|
14
|
Ohlson T, Elofsson A. ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics 2005; 6:253. [PMID: 16225676 PMCID: PMC1274300 DOI: 10.1186/1471-2105-6-253] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2005] [Accepted: 10/14/2005] [Indexed: 11/10/2022] Open
Abstract
Background Profile-profile methods have been used for some years now to detect and align homologous proteins. The best such methods use information from the background distribution of amino acids and substitution tables either when constructing the profiles or in the scoring. This makes the methods dependent on the quality and choice of substitution table as well as the construction of the profiles. Here, we introduce a novel method called ProfNet that is used to derive a profile-profile scoring function. The method optimizes the discrimination between scores of related and unrelated residues and it is fast and straightforward to use. This new method derives a scoring function that is mainly dependent on the actual alignment of residues from a training set, and it does not use any additional information about the background distribution. Results It is shown that ProfNet improves the discrimination of related and unrelated residues. Further it can be used to improve the alignment of distantly related proteins. Conclusion The best performance is obtained using superfamily related proteins in the training of ProfNet, and a classifier that is related to the distance between the structurally aligned residues. The main difference between the new scoring function and a traditional profile-profile scoring function is that conserved residues on average score higher with the new function.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Arne Elofsson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| |
Collapse
|
15
|
Pearson WR, Sierk ML. The limits of protein sequence comparison? Curr Opin Struct Biol 2005; 15:254-60. [PMID: 15919194 PMCID: PMC2845305 DOI: 10.1016/j.sbi.2005.05.005] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2005] [Revised: 04/30/2005] [Accepted: 05/05/2005] [Indexed: 11/29/2022]
Abstract
Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.
Collapse
Affiliation(s)
- William R Pearson
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA.
| | | |
Collapse
|
16
|
Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 2005; 15:285-9. [PMID: 15939584 DOI: 10.1016/j.sbi.2005.05.011] [Citation(s) in RCA: 310] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2005] [Revised: 04/29/2005] [Accepted: 05/09/2005] [Indexed: 10/25/2022]
Abstract
For the past ten years, CASP (Critical Assessment of Structure Prediction) has monitored the state of the art in modeling protein structure from sequence. During this period, there has been substantial progress in both comparative modeling of structure (using information from an evolutionarily related structural template) and template-free modeling. The quality of comparative models depends on the closeness of the evolutionary relationship on which they are based. Template-free modeling, although still very approximate, now produces topologically near correct models for some small proteins. Current major challenges are refining comparative models so that they match experimental accuracy, obtaining accurate sequence alignments for models based on remote evolutionary relationships, and extending template-free modeling methods so that they produce more accurate models, handle parts of comparative models not available from a template and deal with larger structures.
Collapse
Affiliation(s)
- John Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| |
Collapse
|
17
|
Fariselli P, Finelli M, Rossi I, Amico M, Zauli A, Martelli PL, Casadio R. TRAMPLE: the transmembrane protein labelling environment. Nucleic Acids Res 2005; 33:W198-201. [PMID: 15980454 PMCID: PMC1160201 DOI: 10.1093/nar/gki440] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
TRAMPLE () is a web application server dedicated to the detection and the annotation of transmembrane protein sequences. TRAMPLE includes different state-of-the-art algorithms for the prediction of signal peptides, transmembrane segments (both beta-strands and alpha-helices), secondary structure and fast fold recognition. TRAMPLE also includes a complete content management system to manage the results of the predictions. Each user of the server has his/her own workplace, where the data can be stored, organized, accessed and annotated with documents through a simple web-based interface. In this manner, TRAMPLE significantly improves usability with respect to other more traditional web servers.
Collapse
Affiliation(s)
| | | | - Ivan Rossi
- BioDec S.r.l., Almacubevia Fanin 48, I-40127 Bologna, Italy
| | - Mauro Amico
- BioDec S.r.l., Almacubevia Fanin 48, I-40127 Bologna, Italy
| | - Andrea Zauli
- BioDec S.r.l., Almacubevia Fanin 48, I-40127 Bologna, Italy
| | | | - Rita Casadio
- To whom correspondence should be addressed. Tel: +39 051 2094005; Fax: +39 051 242576;
| |
Collapse
|
18
|
Abstract
Modeling a protein structure based on a homologous structure is a standard method in structural biology today. In this process an alignment of a target protein sequence onto the structure of a template(s) is used as input to a program that constructs a 3D model. It has been shown that the most important factor in this process is the correctness of the alignment and the choice of the best template structure(s), while it is generally believed that there are no major differences between the best modeling programs. Therefore, a large number of studies to benchmark the alignment qualities and the selection process have been performed. However, to our knowledge no large-scale benchmark has been performed to evaluate the programs used to transform the alignment to a 3D model. In this study, a benchmark of six different homology modeling programs- Modeller, SegMod/ENCAD, SWISS-MODEL, 3D-JIGSAW, nest, and Builder-is presented. The performance of these programs is evaluated using physiochemical correctness and structural similarity to the correct structure. From our analysis it can be concluded that no single modeling program outperform the others in all tests. However, it is quite clear that three modeling programs, Modeller, nest, and SegMod/ ENCAD, perform better than the others. Interestingly, the fastest and oldest modeling program, SegMod/ ENCAD, performs very well, although it was written more than 10 years ago and has not undergone any development since. It can also be observed that none of the homology modeling programs builds side chains as well as a specialized program (SCWRL), and therefore there should be room for improvement.
Collapse
Affiliation(s)
- Björn Wallner
- Stockholm Bioinformatics Center, Albanova University Center, Stockholm University, Stockholm, Sweden.
| | | |
Collapse
|
19
|
Ekman D, Björklund AK, Frey-Skött J, Elofsson A. Multi-domain Proteins in the Three Kingdoms of Life: Orphan Domains and Other Unassigned Regions. J Mol Biol 2005; 348:231-43. [PMID: 15808866 DOI: 10.1016/j.jmb.2005.02.007] [Citation(s) in RCA: 169] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2004] [Revised: 01/31/2005] [Accepted: 02/02/2005] [Indexed: 11/17/2022]
Abstract
Comparative studies of the proteomes from different organisms have provided valuable information about protein domain distribution in the kingdoms of life. Earlier studies have been limited by the fact that only about 50% of the proteomes could be matched to a domain. Here, we have extended these studies by including less well-defined domain definitions, Pfam-B and clustered domains, MAS, in addition to Pfam-A and SCOP domains. It was found that a significant fraction of these domain families are homologous to Pfam-A or SCOP domains. Further, we show that all regions that do not match a Pfam-A or SCOP domain contain a significantly higher fraction of disordered structure. These unstructured regions may be contained within orphan domains or function as linkers between structured domains. Using several different definitions we have re-estimated the number of multi-domain proteins in different organisms and found that several methods all predict that eukaryotes have approximately 65% multi-domain proteins, while the prokaryotes consist of approximately 40% multi-domain proteins. However, these numbers are strongly dependent on the exact choice of cut-off for domains in unassigned regions. In conclusion, all eukaryotes have similar fractions of multi-domain proteins and disorder, whereas a high fraction of repeating domain is distinguished only in multicellular eukaryotes. This implies a role for repeats in cell-cell contacts while the other two features are important for intracellular functions.
Collapse
Affiliation(s)
- Diana Ekman
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | | | |
Collapse
|
20
|
Han S, Lee BC, Yu ST, Jeong CS, Lee S, Kim D. Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 2005; 21:2667-73. [PMID: 15769835 DOI: 10.1093/bioinformatics/bti384] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity.
Collapse
Affiliation(s)
- Sangjo Han
- Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Korea
| | | | | | | | | | | |
Collapse
|
21
|
Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide improved fold-recognition: A study of different profile-profile alignment methods. Proteins 2004; 57:188-97. [PMID: 15326603 DOI: 10.1002/prot.20184] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
To improve the detection of related proteins, it is often useful to include evolutionary information for both the query and target proteins. One method to include this information is by the use of profile-profile alignments, where a profile from the query protein is compared with the profiles from the target proteins. Profile-profile alignments can be implemented in several fundamentally different ways. The similarity between two positions can be calculated using a dot-product, a probabilistic model, or an information theoretical measure. Here, we present a large-scale comparison of different profile-profile alignment methods. We show that the profile-profile methods perform at least 30% better than standard sequence-profile methods both in their ability to recognize superfamily-related proteins and in the quality of the obtained alignments. Although the performance of all methods is quite similar, profile-profile methods that use a probabilistic scoring function have an advantage as they can create good alignments and show a good fold recognition capacity using the same gap-penalties, while the other methods need to use different parameters to obtain comparable performances.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden
| | | | | |
Collapse
|