51
|
Qiu P, Cai XY, Ding W, Zhang Q, Norris ED, Greene JR. HCV genotyping using statistical classification approach. J Biomed Sci 2009; 16:62. [PMID: 19586537 PMCID: PMC2720937 DOI: 10.1186/1423-0127-16-62] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2009] [Accepted: 07/08/2009] [Indexed: 01/24/2023] Open
Abstract
The genotype of Hepatitis C Virus (HCV) strains is an important determinant of the severity and aggressiveness of liver infection as well as patient response to antiviral therapy. Fast and accurate determination of viral genotype could provide direction in the clinical management of patients with chronic HCV infections. Using publicly available HCV nucleotide sequences, we built a global Position Weight Matrix (PWM) for the HCV genome. Based on the PWM, a set of genotype specific nucleotide sequence "signatures" were selected from the 5' NCR, CORE, E1, and NS5B regions of the HCV genome. We evaluated the predictive power of these signatures for predicting the most common HCV genotypes and subtypes. We observed that nucleotide sequence signatures selected from NS5B and E1 regions generally demonstrated stronger discriminant power in differentiating major HCV genotypes and subtypes than that from 5' NCR and CORE regions. Two discriminant methods were used to build predictive models. Through 10 fold cross validation, over 99% prediction accuracy was achieved using both support vector machine (SVM) and random forest based classification methods in a dataset of 1134 sequences for NS5B and 947 sequences for E1. Prediction accuracy for each genotype is also reported.
Collapse
Affiliation(s)
- Ping Qiu
- Molecular Design and Informatics, Schering-Plough Research Institute, 2015 Galloping Hill Road, Kenilworth, NJ 07033, USA.
| | | | | | | | | | | |
Collapse
|
52
|
Du X, Cheng J, Song J. Identifying protein-protein interaction sites using covering algorithm. Int J Mol Sci 2009; 10:2190-2202. [PMID: 19564948 PMCID: PMC2695276 DOI: 10.3390/ijms10052190] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Revised: 04/30/2009] [Accepted: 05/13/2009] [Indexed: 12/03/2022] Open
Abstract
Identification of protein-protein interface residues is crucial for structural biology. This paper proposes a covering algorithm for predicting protein-protein interface residues with features including protein sequence profile and residue accessible area. This method adequately utilizes the characters of a covering algorithm which have simple, lower complexity and high accuracy for high dimension data. The covering algorithm can achieve a comparable performance (69.62%, Complete dataset; 60.86%, Trim dataset with overall accuracy) to a support vector machine and maximum entropy on our dataset, a correlation coefficient (CC) of 0.2893, 58.83% specificity, 56.12% sensitivity on the Complete dataset and 0.2144 (CC), 53.34% (specificity), 65.59% (sensitivity) on the Trim dataset in identifying interface residues by 5-fold cross-validation on 61 protein chains. This result indicates that the covering algorithm is a powerful and robust protein-protein interaction site prediction method that can guide biologists to make specific experiments on proteins. Examination of the predictions in the context of the 3-dimensional structures of proteins demonstrates the effectiveness of this method.
Collapse
Affiliation(s)
- Xiuquan Du
- The Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Anhui, China; E-Mails:
(J.-X.C.);
(J.S.)
| | - Jiaxing Cheng
- The Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Anhui, China; E-Mails:
(J.-X.C.);
(J.S.)
| | - Jie Song
- The Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Anhui, China; E-Mails:
(J.-X.C.);
(J.S.)
| |
Collapse
|
53
|
Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML. Progress and challenges in predicting protein-protein interaction sites. Brief Bioinform 2009; 10:233-46. [PMID: 19346321 DOI: 10.1093/bib/bbp021] [Citation(s) in RCA: 113] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The identification of protein-protein interaction sites is an essential intermediate step for mutant design and the prediction of protein networks. In recent years a significant number of methods have been developed to predict these interface residues and here we review the current status of the field. Progress in this area requires a clear view of the methodology applied, the data sets used for training and testing the systems, and the evaluation procedures. We have analysed the impact of a representative set of features and algorithms and highlighted the problems inherent in generating reliable protein data sets and in the posterior analysis of the results. Although it is clear that there have been some improvements in methods for predicting interacting sites, several major bottlenecks remain. Proteins in complexes are still under-represented in the structural databases and in particular many proteins involved in transient complexes are still to be crystallized. We provide suggestions for effective feature selection, and make it clear that community standards for testing, training and performance measures are necessary for progress in the field.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Centro Nacional de Biotechnolgia, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | | | | | | | |
Collapse
|
54
|
Ward RM, Venner E, Daines B, Murray S, Erdin S, Kristensen DM, Lichtarge O. Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates. Bioinformatics 2009; 25:1426-7. [PMID: 19307237 PMCID: PMC2682511 DOI: 10.1093/bioinformatics/btp160] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
SUMMARY The Evolutionary Trace Annotation (ETA) Server predicts enzymatic activity. ETA starts with a structure of unknown function, such as those from structural genomics, and with no prior knowledge of its mechanism uses the phylogenetic Evolutionary Trace (ET) method to extract key functional residues and propose a function-associated 3D motif, called a 3D template. ETA then searches previously annotated structures for geometric template matches that suggest molecular and thus functional mimicry. In order to maximize the predictive value of these matches, ETA next applies distinctive specificity filters -- evolutionary similarity, function plurality and match reciprocity. In large scale controls on enzymes, prediction coverage is 43% but the positive predictive value rises to 92%, thus minimizing false annotations. Users may modify any search parameter, including the template. ETA thus expands the ET suite for protein structure annotation, and can contribute to the annotation efforts of metaservers. AVAILABILITY The ETA Server is a web application available at (http://mammoth.bcm.tmc.edu/eta/).
Collapse
Affiliation(s)
- R Matthew Ward
- Department of Molecular and Human Genetics, Program in Structural and Computational Biology and Molecular, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | | | | | | | | | | | | |
Collapse
|
55
|
Tuncbag N, Kar G, Keskin O, Gursoy A, Nussinov R. A survey of available tools and web servers for analysis of protein-protein interactions and interfaces. Brief Bioinform 2009; 10:217-32. [PMID: 19240123 DOI: 10.1093/bib/bbp001] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
The unanimous agreement that cellular processes are (largely) governed by interactions between proteins has led to enormous community efforts culminating in overwhelming information relating to these proteins; to the regulation of their interactions, to the way in which they interact and to the function which is determined by these interactions. These data have been organized in databases and servers. However, to make these really useful, it is essential not only to be aware of these, but in particular to have a working knowledge of which tools to use for a given problem; what are the tool advantages and drawbacks; and no less important how to combine these for a particular goal since usually it is not one tool, but some combination of tool-modules that is needed. This is the goal of this review.
Collapse
Affiliation(s)
- Nurcan Tuncbag
- Computational Sciences and Engineering Program at Koc University, Istanbul, Turkey
| | | | | | | | | |
Collapse
|
56
|
Prediction of protein-protein interaction sites in sequences and 3D structures by random forests. PLoS Comput Biol 2009; 5:e1000278. [PMID: 19180183 PMCID: PMC2621338 DOI: 10.1371/journal.pcbi.1000278] [Citation(s) in RCA: 111] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 12/16/2008] [Indexed: 11/19/2022] Open
Abstract
Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras–Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information. In their active state, proteins—the workhorses of a living cell—need to have a defined 3D structure. The majority of functions in the living cell are performed through protein interactions that occur through specific, often unknown, residues on their surfaces. We can study protein interactions either qualitatively (interaction: yes/no) using large-scale, high-throughput experiments or determine specific interaction sites by using biophysical techniques, such as, for example, X-ray crystallography, that are much more laborious and yet unable to provide us with a complete interaction map within the cell. This paper presents the machine learning classification method termed “Random Forests” in its application to predicting interaction sites. We use interaction data from available experimental evidence to train the classifier and predict the interacting residues on proteins with unknown 3D structures. Using this approach, we are able to predict many more interactions in greater detail (i.e., to accurately predict most of the binding site) and with that to infer knowledge about the functions of unknown proteins.
Collapse
|
57
|
del Sol A, Carbonell P. The modular organization of domain structures: insights into protein-protein binding. PLoS Comput Biol 2008; 3:e239. [PMID: 18069884 PMCID: PMC2134966 DOI: 10.1371/journal.pcbi.0030239] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2007] [Accepted: 10/17/2007] [Indexed: 12/21/2022] Open
Abstract
Domains are the building blocks of proteins and play a crucial role in protein–protein interactions. Here, we propose a new approach for the analysis and prediction of domain–domain interfaces. Our method, which relies on the representation of domains as residue-interacting networks, finds an optimal decomposition of domain structures into modules. The resulting modules comprise highly cooperative residues, which exhibit few connections with other modules. We found that non-overlapping binding sites in a domain, involved in different domain–domain interactions, are generally contained in different modules. This observation indicates that our modular decomposition is able to separate protein domains into regions with specialized functions. Our results show that modules with high modularity values identify binding site regions, demonstrating the predictive character of modularity. Furthermore, the combination of modularity with other characteristics, such as sequence conservation or surface patches, was found to improve our predictions. In an attempt to give a physical interpretation to the modular architecture of domains, we analyzed in detail six examples of protein domains with available experimental binding data. The modular configuration of the TEM1-β-lactamase binding site illustrates the energetic independence of hotspots located in different modules and the cooperativity of those sited within the same modules. The energetic and structural cooperativity between intramodular residues is also clearly shown in the example of the chymotrypsin inhibitor, where non–binding site residues have a synergistic effect on binding. Interestingly, the binding site of the T cell receptor β chain variable domain 2.1 is contained in one module, which includes structurally distant hot regions displaying positive cooperativity. These findings support the idea that modules possess certain functional and energetic independence. A modular organization of binding sites confers robustness and flexibility to the performance of the functional activity, and facilitates the evolution of protein interactions. Proteins are built by domains, which mediate protein–protein interactions involved in different biological activities. A challenging problem in computational biology is the understanding of the domain–domain interaction mechanism. Here, we propose a new approach for the analysis and prediction of domain–domain binding sites. Our computational approach, which relies on the modular division of 3-D domain structures, identifies modular regions involved in binding and can complement previously introduced predictive methods. Further results illustrate that binding sites display a modular configuration. A detailed analysis of protein domains with available experimental binding data revealed that modules are energetically independent from each other, whereas residues within modules contribute cooperatively to the binding energy. The modular composition of binding surfaces may generate high binding affinity and specificity, and facilitate the appearance of new domain binding partners. This advantageous organization of protein structures has been conserved by evolution and may be used to design an effective drug strategy.
Collapse
Affiliation(s)
- Antonio del Sol
- Bioinformatics Research Unit, Research and Development Division, Fujirebio, Tokyo, Japan.
| | | |
Collapse
|
58
|
Abstract
MOTIVATION Thousands of proteins are known to bind to DNA; for most of them the mechanism of action and the residues that bind to DNA, i.e. the binding sites, are yet unknown. Experimental identification of binding sites requires expensive and laborious methods such as mutagenesis and binding essays. Hence, such studies are not applicable on a large scale. If the 3D structure of a protein is known, it is often possible to predict DNA-binding sites in silico. However, for most proteins, such knowledge is not available. RESULTS It has been shown that DNA-binding residues have distinct biophysical characteristics. Here we demonstrate that these characteristics are so distinct that they enable accurate prediction of the residues that bind DNA directly from amino acid sequence, without requiring any additional experimental or structural information. In a cross-validation based on the largest non-redundant dataset of high-resolution protein-DNA complexes available today, we found that 89% of our predictions are confirmed by experimental data. Thus, it is now possible to identify DNA-binding sites on a proteomic scale even in the absence of any experimental data or 3D-structural information. AVAILABILITY http://cubic.bioc.columbia.edu/services/disis.
Collapse
Affiliation(s)
- Yanay Ofran
- Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA.
| | | | | |
Collapse
|
59
|
Sugaya N, Ikeda K, Tashiro T, Takeda S, Otomo J, Ishida Y, Shiratori A, Toyoda A, Noguchi H, Takeda T, Kuhara S, Sakaki Y, Iwayanagi T. An integrative in silico approach for discovering candidates for drug-targetable protein-protein interactions in interactome data. BMC Pharmacol 2007; 7:10. [PMID: 17705877 PMCID: PMC2045083 DOI: 10.1186/1471-2210-7-10] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2007] [Accepted: 08/20/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) are challenging but attractive targets for small chemical drugs. Whole PPIs, called the 'interactome', have been emerged in several organisms, including human, based on the recent development of high-throughput screening (HTS) technologies. Individual PPIs have been targeted by small drug-like chemicals (SDCs), however, interactome data have not been fully utilized for exploring drug targets due to the lack of comprehensive methodology for utilizing these data. Here we propose an integrative in silico approach for discovering candidates for drug-targetable PPIs in interactome data. RESULTS Our novel in silico screening system comprises three independent assessment procedures: i) detection of protein domains responsible for PPIs, ii) finding SDC-binding pockets on protein surfaces, and iii) evaluating similarities in the assignment of Gene Ontology (GO) terms between specific partner proteins. We discovered six candidates for drug-targetable PPIs by applying our in silico approach to original human PPI data composed of 770 binary interactions produced by our HTS yeast two-hybrid (HTS-Y2H) assays. Among them, we further examined two candidates, RXRA/NRIP1 and CDK2/CDKN1A, with respect to their biological roles, PPI network around each candidate, and tertiary structures of the interacting domains. CONCLUSION An integrative in silico approach for discovering candidates for drug-targetable PPIs was applied to original human PPIs data. The system excludes false positive interactions and selects reliable PPIs as drug targets. Its effectiveness was demonstrated by the discovery of the six promising candidate target PPIs. Inhibition or stabilization of the two interactions may have potential therapeutic effects against human diseases.
Collapse
Affiliation(s)
- Nobuyoshi Sugaya
- PharmaDesign, Inc., 2-19-8 Hatchobori, Chuo-ku, Tokyo, 104-0032, Japan
| | - Kazuyoshi Ikeda
- PharmaDesign, Inc., 2-19-8 Hatchobori, Chuo-ku, Tokyo, 104-0032, Japan
| | - Toshiyuki Tashiro
- PharmaDesign, Inc., 2-19-8 Hatchobori, Chuo-ku, Tokyo, 104-0032, Japan
| | - Shizu Takeda
- Central Research Laboratory, Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo, 185-8601, Japan
| | - Jun Otomo
- Central Research Laboratory, Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo, 185-8601, Japan
| | - Yoshiko Ishida
- Central Research Laboratory, Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo, 185-8601, Japan
| | - Akiko Shiratori
- Central Research Laboratory, Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo, 185-8601, Japan
| | - Atsushi Toyoda
- Genomic Sciences Center, RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Hideki Noguchi
- Genomic Sciences Center, RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Tadayuki Takeda
- Genomic Sciences Center, RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Satoru Kuhara
- Graduate School of Genetic Resources Technology, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka, 812-8581, Japan
| | - Yoshiyuki Sakaki
- Genomic Sciences Center, RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | - Takao Iwayanagi
- Research & Development Group, Hitachi, Ltd., 1-6-1 Marunouchi, Chiyoda-ku, Tokyo, 100-8220, Japan
| |
Collapse
|
60
|
Kundrotas P, Alexov E. Predicting interacting and interfacial residues using continuous sequence segments. Int J Biol Macromol 2007; 41:615-23. [PMID: 17850859 DOI: 10.1016/j.ijbiomac.2007.08.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2007] [Revised: 07/31/2007] [Accepted: 08/01/2007] [Indexed: 01/07/2023]
Abstract
Development of sequence-based methods for predicting putative interfacial residues is an extremely important task in modeling 3D structures of protein-protein complexes. In the present paper we used non-gapped sequence segments to predict both interacting and interfacial residues. We demonstrated that continuous sequence segments do occur at the protein-protein interfaces and showed that continuous interacting interfacial segments (CIIS) of length nine are presented on average, in approximately 37% of the complexes in our dataset. Our results indicate that CIIS consist mostly of interacting strands and/or loops, while the CIIS involving the helixes are scarce. We performed scoring of CIIS using four different scoring mechanisms and found that scores of CIIS differ significantly from the scores calculated for random stretches of residues. We argue that such statistical difference inferred thought the corresponding Z-scores could be used for detecting putative interfacial residue segments without using any structural information. This hypothesis was tested on our dataset and benchmarking resulted to 10-60% prediction accuracy depending on type of benchmarking and scoring scheme used in calculations. Such predictions that do not depend on the availability of the 3D structures of monomers can be quite valuable in modeling 3D structures of obligatory complexes, for which structures of separated monomers do not exist.
Collapse
Affiliation(s)
- Petras Kundrotas
- Computational Biophysics and Bioinformatics, Department of Physics, Clemson University, Clemson, SC 29634, United States
| | | |
Collapse
|
61
|
Ofran Y, Rost B. Protein-protein interaction hotspots carved into sequences. PLoS Comput Biol 2007; 3:e119. [PMID: 17630824 PMCID: PMC1914369 DOI: 10.1371/journal.pcbi.0030119] [Citation(s) in RCA: 179] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2006] [Accepted: 05/11/2007] [Indexed: 11/24/2022] Open
Abstract
Protein-protein interactions, a key to almost any biological process, are mediated by molecular mechanisms that are not entirely clear. The study of these mechanisms often focuses on all residues at protein-protein interfaces. However, only a small subset of all interface residues is actually essential for recognition or binding. Commonly referred to as "hotspots," these essential residues are defined as residues that impede protein-protein interactions if mutated. While no in silico tool identifies hotspots in unbound chains, numerous prediction methods were designed to identify all the residues in a protein that are likely to be a part of protein-protein interfaces. These methods typically identify successfully only a small fraction of all interface residues. Here, we analyzed the hypothesis that the two subsets correspond (i.e., that in silico methods may predict few residues because they preferentially predict hotspots). We demonstrate that this is indeed the case and that we can therefore predict directly from the sequence of a single protein which residues are interaction hotspots (without knowledge of the interaction partner). Our results suggested that most protein complexes are stabilized by similar basic principles. The ability to accurately and efficiently identify hotspots from sequence enables the annotation and analysis of protein-protein interaction hotspots in entire organisms and thus may benefit function prediction and drug development. The server for prediction is available at http://www.rostlab.org/services/isis.
Collapse
Affiliation(s)
- Yanay Ofran
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.
| | | |
Collapse
|
62
|
Abstract
MOTIVATION Proteins function through interactions with other proteins and biomolecules. Protein-protein interfaces hold key information toward molecular understanding of protein function. In the past few years, there have been intensive efforts in developing methods for predicting protein interface residues. A review that presents the current status of interface prediction and an overview of its applications and project future developments is in order. SUMMARY Interface prediction methods rely on a wide range of sequence, structural and physical attributes that distinguish interface residues from non-interface surface residues. The input data are manipulated into either a numerical value or a probability representing the potential for a residue to be inside a protein interface. Predictions are now satisfactory for complex-forming proteins that are well represented in the Protein Data Bank, but less so for under-represented ones. Future developments will be directed at tackling problems such as building structural models for multi-component structural complexes.
Collapse
Affiliation(s)
- Huan-Xiang Zhou
- Institute of Molecular Biophysics, Florida State University, Tallahassee, Florida 32306, USA.
| | | |
Collapse
|
63
|
Hsu CM, Chen CY, Liu BJ, Huang CC, Laio MH, Lin CC, Wu TL. Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics 2007; 8 Suppl 5:S8. [PMID: 17570867 PMCID: PMC1892096 DOI: 10.1186/1471-2105-8-s5-s8] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard regions. Thus, the proposed mining technology considers large irregular gaps when growing patterns, in order to find the residues that are simultaneously conserved but largely separated on the sequences. A derived pattern is called a cluster-like pattern since the discovered conserved residues are always grouped into several blocks, which each corresponds to a local conserved region on the protein sequence. RESULTS The experiments conducted in this work demonstrate that the derived long patterns automatically discover the important residues that form one or several hot regions of protein-protein interactions. The methodology is evaluated by conducting experiments on the web server MAGIIC-PRO based on a well known benchmark containing 220 protein chains from 72 distinct complexes. Among the tested 218 proteins, there are 900 sequential blocks discovered, 4.25 blocks per protein chain on average. About 92% of the derived blocks are observed to be clustered in space with at least one of the other blocks, and about 66% of the blocks are found to be near the interface of protein-protein interactions. It is summarized that for about 83% of the tested proteins, at least two interacting blocks can be discovered by this approach. CONCLUSION This work aims to demonstrate that the important residues associated with the interface of protein-protein interactions may be automatically discovered by sequential pattern mining. The detected regions possess high conservation and thus are considered as the computational hot regions. This information would be useful to characterizing protein sequences, predicting protein function, finding potential partners, and facilitating protein docking for drug discovery.
Collapse
Affiliation(s)
- Chen-Ming Hsu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, R.O.C
| | - Chien-Yu Chen
- Department of Bio-industrial Mechatronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C
| | - Baw-Jhiune Liu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, R.O.C
| | - Chih-Chang Huang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, R.O.C
| | - Min-Hung Laio
- Department of Bio-industrial Mechatronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C
| | - Chien-Chieh Lin
- Department of Bio-industrial Mechatronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C
| | - Tzung-Lin Wu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, R.O.C
| |
Collapse
|
64
|
Abstract
Protein-protein interactions create the macromolecular assemblies and sequential signaling pathways essential for cell function. Their number far exceeds the number of proteins themselves and their experimental characterization, while improving, remains relatively slow. For these reasons, novel computational methods have important roles to play in understanding the physical basis of protein interactions, and in constraining the molecular basis of their specificity. This paper discusses methods based on multiple sequence alignments of protein homologues and phylogenetic trees.
Collapse
Affiliation(s)
- Ivica Res
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | | |
Collapse
|
65
|
Dong Q, Wang X, Lin L, Guan Y. Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinformatics 2007; 8:147. [PMID: 17480235 PMCID: PMC1885810 DOI: 10.1186/1471-2105-8-147] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2007] [Accepted: 05/05/2007] [Indexed: 01/14/2023] Open
Abstract
Background Recognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological and biochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Although different amino acid compositions among the interaction sites of different complexes have been observed, such differences have not been integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerful propensity. Result In this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes, hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profiles and accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by taking evolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experiment is performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvement of the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities can significantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall. Conclusion Although there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for the complicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, which indicates that the propensities at the profile level are more accurate than those at the residue level.
Collapse
Affiliation(s)
- Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Lei Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yi Guan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
66
|
Abstract
MOTIVATION Large-scale experiments reveal pairs of interacting proteins but leave the residues involved in the interactions unknown. These interface residues are essential for understanding the mechanism of interaction and are often desired drug targets. Reliable identification of residues that reside in protein-protein interface typically requires analysis of protein structure. Therefore, for the vast majority of proteins, for which there is no high-resolution structure, there is no effective way of identifying interface residues. RESULTS Here we present a machine learning-based method that identifies interacting residues from sequence alone. Although the method is developed using transient protein-protein interfaces from complexes of experimentally known 3D structures, it never explicitly uses 3D information. Instead, we combine predicted structural features with evolutionary information. The strongest predictions of the method reached over 90% accuracy in a cross-validation experiment. Our results suggest that despite the significant diversity in the nature of protein-protein interactions, they all share common basic principles and that these principles are identifiable from sequence alone.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC & North-East Structural Genomics Consortium, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
67
|
Li MH, Lin L, Wang XL, Liu T. Protein-protein interaction site prediction based on conditional random fields. Bioinformatics 2007; 23:597-604. [PMID: 17234636 DOI: 10.1093/bioinformatics/btl660] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We are motivated by the fast-growing number of protein structures in the Protein Data Bank with necessary information for prediction of protein-protein interaction sites to develop methods for identification of residues participating in protein-protein interactions. We would like to compare conditional random fields (CRFs)-based method with conventional classification-based methods that omit the relation between two labels of neighboring residues to show the advantages of CRFs-based method in predicting protein-protein interaction sites. RESULTS The prediction of protein-protein interaction sites is solved as a sequential labeling problem by applying CRFs with features including protein sequence profile and residue accessible surface area. The CRFs-based method can achieve a comparable performance with state-of-the-art methods, when 1276 nonredundant hetero-complex protein chains are used as training and test set. Experimental result shows that CRFs-based method is a powerful and robust protein-protein interaction site prediction method and can be used to guide biologists to make specific experiments on proteins. AVAILABILITY http://www.insun.hit.edu.cn/~mhli/site_CRFs/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ming-Hui Li
- Bioinformatics Research Group, ITNLP Lab, Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | | | | | | |
Collapse
|
68
|
Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR. Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol 2006; 362:365-86. [PMID: 16919296 DOI: 10.1016/j.jmb.2006.07.028] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2006] [Revised: 06/15/2006] [Accepted: 07/13/2006] [Indexed: 11/26/2022]
Abstract
Identifying the interface between two interacting proteins provides important clues to the function of a protein, and is becoming increasing relevant to drug discovery. Here, surface patch analysis was combined with a Bayesian network to predict protein-protein binding sites with a success rate of 82% on a benchmark dataset of 180 proteins, improving by 6% on previous work and well above the 36% that would be achieved by a random method. A comparable success rate was achieved even when evolutionary information was missing, a further improvement on our previous method which was unable to handle incomplete data automatically. In a case study of the Mog1p family, we showed that our Bayesian network method can aid the prediction of previously uncharacterised binding sites and provide important clues to protein function. On Mog1p itself a putative binding site involved in the SLN1-SKN7 signal transduction pathway was detected, as was a Ran binding site, previously characterized solely by conservation studies, even though our automated method operated without using homologous proteins. On the remaining members of the family (two structural genomics targets, and a protein involved in the photosystem II complex in higher plants) we identified novel binding sites with little correspondence to those on Mog1p. These results suggest that members of the Mog1p family bind to different proteins and probably have different functions despite sharing the same overall fold. We also demonstrated the applicability of our method to drug discovery efforts by successfully locating a number of binding sites involved in the protein-protein interaction network of papilloma virus infection. In a separate study, we attempted to distinguish between the two types of binding site, obligate and non-obligate, within our dataset using a second Bayesian network. This proved difficult although some separation was achieved on the basis of patch size, electrostatic potential and conservation. Such was the similarity between the two interacting patch types, we were able to use obligate binding site properties to predict the location of non-obligate binding sites and vice versa.
Collapse
Affiliation(s)
- James R Bradford
- Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK
| | | | | | | |
Collapse
|
69
|
Söding J, Remmert M, Biegert A, Lupas AN. HHsenser: exhaustive transitive profile search using HMM-HMM comparison. Nucleic Acids Res 2006; 34:W374-8. [PMID: 16845029 PMCID: PMC1538784 DOI: 10.1093/nar/gkl195] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
HHsenser is the first server to offer exhaustive intermediate profile searches, which it combines with pairwise comparison of hidden Markov models. Starting from a single protein sequence or a multiple alignment, it can iteratively explore whole superfamilies, producing few or no false positives. The output is a multiple alignment of all detected homologs. HHsenser's sensitivity should make it a useful tool for evolutionary studies. It may also aid applications that rely on diverse multiple sequence alignments as input, such as homology-based structure and function prediction, or the determination of functional residues by conservation scoring and functional subtyping. HHsenser can be accessed at . It has also been integrated into our structure and function prediction server HHpred () to improve predictions for near-singleton sequences.
Collapse
Affiliation(s)
- Johannes Söding
- Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Spemannstrasse 35, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
70
|
Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett 2005; 580:380-4. [PMID: 16376878 DOI: 10.1016/j.febslet.2005.11.081] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2005] [Revised: 11/29/2005] [Accepted: 11/30/2005] [Indexed: 12/01/2022]
Abstract
This paper proposes a novel method that can predict protein interaction sites in heterocomplexes using residue spatial sequence profile and evolution rate approaches. The former represents the information of multiple sequence alignments while the latter corresponds to a residue's evolutionary conservation score based on a phylogenetic tree. Three predictors using a support vector machines algorithm are constructed to predict whether a surface residue is a part of a protein-protein interface. The efficiency and the effectiveness of our proposed approach is verified by its better prediction performance compared with other models. The study is based on a non-redundant data set of heterodimers consisting of 69 protein chains.
Collapse
Affiliation(s)
- Bing Wang
- Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
| | | | | | | | | | | |
Collapse
|
71
|
Park KJ, Gromiha MM, Horton P, Suwa M. Discrimination of outer membrane proteins using support vector machines. Bioinformatics 2005; 21:4223-9. [PMID: 16204348 DOI: 10.1093/bioinformatics/bti697] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important task both for dissecting outer membrane proteins (OMPs) from genomic sequences and for the successful prediction of their secondary and tertiary structures. RESULTS We have developed a method based on support vector machines using amino acid composition and residue pair information. Our approach with amino acid composition has correctly predicted the OMPs with a cross-validated accuracy of 94% in a set of 208 proteins. Further, this method has successfully excluded 633 of 673 globular proteins and 191 of 206 alpha-helical membrane proteins. We obtained an overall accuracy of 92% for correctly picking up the OMPs from a dataset of 1087 proteins belonging to all different types of globular and membrane proteins. Furthermore, residue pair information improved the accuracy from 92 to 94%. This accuracy of discriminating OMPs is higher than that of other methods in the literature, which could be used for dissecting OMPs from genomic sequences. AVAILABILITY Discrimination results are available at http://tmbeta-svm.cbrc.jp.
Collapse
Affiliation(s)
- Keun-Joon Park
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | | | | | | |
Collapse
|