51
|
Miao Z, Westhof E. A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs. PLoS Comput Biol 2015; 11:e1004639. [PMID: 26681179 PMCID: PMC4683125 DOI: 10.1371/journal.pcbi.1004639] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 10/30/2015] [Indexed: 11/18/2022] Open
Abstract
Computational prediction of nucleic acid binding sites in proteins are necessary to disentangle functional mechanisms in most biological processes and to explore the binding mechanisms. Several strategies have been proposed, but the state-of-the-art approaches display a great diversity in i) the definition of nucleic acid binding sites; ii) the training and test datasets; iii) the algorithmic methods for the prediction strategies; iv) the performance measures and v) the distribution and availability of the prediction programs. Here we report a large-scale assessment of 19 web servers and 3 stand-alone programs on 41 datasets including more than 5000 proteins derived from 3D structures of protein-nucleic acid complexes. Well-defined binary assessment criteria (specificity, sensitivity, precision, accuracy…) are applied. We found that i) the tools have been greatly improved over the years; ii) some of the approaches suffer from theoretical defects and there is still room for sorting out the essential mechanisms of binding; iii) RNA binding and DNA binding appear to follow similar driving forces and iv) dataset bias may exist in some methods.
Collapse
Affiliation(s)
- Zhichao Miao
- Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du CNRS, Strasbourg, France
| | - Eric Westhof
- Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du CNRS, Strasbourg, France
| |
Collapse
|
52
|
Wong KC, Li Y, Peng C, Moses AM, Zhang Z. Computational learning on specificity-determining residue-nucleotide interactions. Nucleic Acids Res 2015; 43:10180-9. [PMID: 26527718 PMCID: PMC4666365 DOI: 10.1093/nar/gkv1134] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2015] [Accepted: 10/18/2015] [Indexed: 01/02/2023] Open
Abstract
The protein–DNA interactions between transcription factors and transcription factor binding sites are essential activities in gene regulation. To decipher the binding codes, it is a long-standing challenge to understand the binding mechanism across different transcription factor DNA binding families. Past computational learning studies usually focus on learning and predicting the DNA binding residues on protein side. Taking into account both sides (protein and DNA), we propose and describe a computational study for learning the specificity-determining residue-nucleotide interactions of different known DNA-binding domain families. The proposed learning models are compared to state-of-the-art models comprehensively, demonstrating its competitive learning performance. In addition, we describe and propose two applications which demonstrate how the learnt models can provide meaningful insights into protein–DNA interactions across different DNA binding families.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Yue Li
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139-4307, USA
| | - Chengbin Peng
- CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Alan M Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
53
|
DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 2015; 5:15479. [PMID: 26482832 PMCID: PMC4611492 DOI: 10.1038/srep15479] [Citation(s) in RCA: 83] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Accepted: 09/28/2015] [Indexed: 02/01/2023] Open
Abstract
DNA-binding proteins play an important role in most cellular processes. Therefore, it is necessary to develop an efficient predictor for identifying DNA-binding proteins only based on the sequence information of proteins. The bottleneck for constructing a useful predictor is to find suitable features capturing the characteristics of DNA binding proteins. We applied PseAAC to DNA binding protein identification, and PseAAC was further improved by incorporating the evolutionary information by using profile-based protein representation. Finally, Combined with Support Vector Machines (SVMs), a predictor called iDNAPro-PseAAC was proposed. Experimental results on an updated benchmark dataset showed that iDNAPro-PseAAC outperformed some state-of-the-art approaches, and it can achieve stable performance on an independent dataset. By using an ensemble learning approach to incorporate more negative samples (non-DNA binding proteins) in the training process, the performance of iDNAPro-PseAAC was further improved. The web server of iDNAPro-PseAAC is available at http://bioinformatics.hitsz.edu.cn/iDNAPro-PseAAC/.
Collapse
|
54
|
Suvorova IA, Korostelev YD, Gelfand MS. GntR Family of Bacterial Transcription Factors and Their DNA Binding Motifs: Structure, Positioning and Co-Evolution. PLoS One 2015; 10:e0132618. [PMID: 26151451 PMCID: PMC4494728 DOI: 10.1371/journal.pone.0132618] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Accepted: 06/16/2015] [Indexed: 12/03/2022] Open
Abstract
The GntR family of transcription factors (TFs) is a large group of proteins present in diverse bacteria and regulating various biological processes. Here we use the comparative genomics approach to reconstruct regulons and identify binding motifs of regulators from three subfamilies of the GntR family, FadR, HutC, and YtrA. Using these data, we attempt to predict DNA-protein contacts by analyzing correlations between binding motifs in DNA and amino acid sequences of TFs. We identify pairs of positions with high correlation between amino acids and nucleotides for FadR, HutC, and YtrA subfamilies and show that the most predicted DNA-protein interactions are quite similar in all subfamilies and conform well to the experimentally identified contacts formed by FadR from E. coli and AraR from B. subtilis. The most frequent predicted contacts in the analyzed subfamilies are Arg-G, Asn-A, Asp-C. We also analyze the divergon structure and preferred site positions relative to regulated genes in the FadR and HutC subfamilies. A single site in a divergon usually regulates both operons and is approximately in the middle of the intergenic area. Double sites are either involved in the co-operative regulation of both operons and then are in the center of the intergenic area, or each site in the pair independently regulates its own operon and tends to be near it. We also identify additional candidate TF-binding boxes near palindromic binding sites of TFs from the FadR, HutC, and YtrA subfamilies, which may play role in the binding of additional TF-subunits.
Collapse
Affiliation(s)
- Inna A. Suvorova
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
- * E-mail:
| | - Yuri D. Korostelev
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
| | - Mikhail S. Gelfand
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia
| |
Collapse
|
55
|
Yan J, Friedrich S, Kurgan L. A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues. Brief Bioinform 2015; 17:88-105. [DOI: 10.1093/bib/bbv023] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Indexed: 01/07/2023] Open
|
56
|
An overview of the prediction of protein DNA-binding sites. Int J Mol Sci 2015; 16:5194-215. [PMID: 25756377 PMCID: PMC4394471 DOI: 10.3390/ijms16035194] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 02/21/2015] [Accepted: 02/27/2015] [Indexed: 02/06/2023] Open
Abstract
Interactions between proteins and DNA play an important role in many essential biological processes such as DNA replication, transcription, splicing, and repair. The identification of amino acid residues involved in DNA-binding sites is critical for understanding the mechanism of these biological activities. In the last decade, numerous computational approaches have been developed to predict protein DNA-binding sites based on protein sequence and/or structural information, which play an important role in complementing experimental strategies. At this time, approaches can be divided into three categories: sequence-based DNA-binding site prediction, structure-based DNA-binding site prediction, and homology modeling and threading. In this article, we review existing research on computational methods to predict protein DNA-binding sites, which includes data sets, various residue sequence/structural features, machine learning methods for comparison and selection, evaluation methods, performance comparison of different tools, and future directions in protein DNA-binding site prediction. In particular, we detail the meta-analysis of protein DNA-binding sites. We also propose specific implications that are likely to result in novel prediction methods, increased performance, or practical applications.
Collapse
|
57
|
Wong MH, Sze-To HYA, Lo LYP, Chan TMC, Leung KS. Discovering Binding Cores in Protein-DNA Binding Using Association Rule Mining with Statistical Measures. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:142-154. [PMID: 26357085 DOI: 10.1109/tcbb.2014.2343952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and for the deep understanding of gene regulation. Traditionally, binding cores are identified in resolved high-resolution 3D structures. However, it is expensive, labor-intensive and time-consuming to obtain these structures. Hence, it is promising to discover binding cores computationally on a large scale. Previous studies successfully applied association rule mining to discover binding cores from TF-TFBS binding sequence data only. Despite the successful results, there are limitations such as the use of tight support and confidence thresholds, the distortion by statistical bias in counting pattern occurrences, and the lack of a unified scheme to rank TF-TFBS associated patterns. In this study, we proposed an association rule mining algorithm incorporating statistical measures and ranking to address these limitations. Experimental results demonstrated that, even when the threshold on support was lowered to one-tenth of the value used in previous studies, a satisfactory verification ratio was consistently observed under different confidence levels. Moreover, we proposed a novel ranking scheme for TF-TFBS associated patterns based on p-values and co-support values. By comparing with other discovery approaches, the effectiveness of our algorithm was demonstrated. Eighty-four binding cores with PDB support are uniquely identified.
Collapse
|
58
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
59
|
Samant M, Jethva M, Hasija Y. INTERACT-O-FINDER: A Tool for Prediction of DNA-Binding Proteins Using Sequence Features. Int J Pept Res Ther 2014. [DOI: 10.1007/s10989-014-9446-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
60
|
Abstract
Background Protein-DNA interactions play important roles in many biological processes. Computational methods that can accurately predict DNA-binding sites on proteins will greatly expedite research on problems involving protein-DNA interactions. Results This paper presents a method for predicting DNA-binding sites on protein structures. The method represents protein surface patches using labeled graphs and uses a graph kernel method to calculate the similarities between graphs. A new surface patch is predicted to be interface or non-interface patch based on its similarities to known DNA-binding patches and non-DNA-binding patches. The proposed method achieved high accuracy when tested on a representative set of 146 protein-DNA complexes using leave-one-out cross-validation. Then, the method was applied to identify DNA-binding sties on 13 unbound structures of DNA-binding proteins. In each of the unbound structure, the top 1 patch predicted by the proposed method precisely indicated the location of the DNA-binding site. Comparisons with other methods showed that the proposed method was competitive in predicting DNA-binding sites on unbound proteins. Conclusions The proposed method uses graphs to encode the feature's distribution in the 3-dimensional (3D) space. Thus, compared with other vector-based methods, it has the advantage of taking into account the spatial distribution of features on the proteins. Using an efficient kernel method to compare graphs the proposed method also avoids the demanding computations required for 3D objects comparison. It provides a competitive method for predicting DNA-binding sites without requiring structure alignment.
Collapse
|
61
|
Fang C, Noguchi T, Yamana H. Simplified sequence-based method for ATP-binding prediction using contextual local evolutionary conservation. Algorithms Mol Biol 2014; 9:7. [PMID: 24618258 PMCID: PMC3995811 DOI: 10.1186/1748-7188-9-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Accepted: 03/05/2014] [Indexed: 12/23/2022] Open
Abstract
Background Identifying ligand-binding sites is a key step to annotate the protein functions and to find applications in drug design. Now, many sequence-based methods adopted various predicted results from other classifiers, such as predicted secondary structure, predicted solvent accessibility and predicted disorder probabilities, to combine with position-specific scoring matrix (PSSM) as input for binding sites prediction. These predicted features not only easily result in high-dimensional feature space, but also greatly increased the complexity of algorithms. Moreover, the performances of these predictors are also largely influenced by the other classifiers. Results In order to verify that conservation is the most powerful attribute in identifying ligand-binding sites, and to show the importance of revising PSSM to match the detailed conservation pattern of functional site in prediction, we have analyzed the Adenosine-5'-triphosphate (ATP) ligand as an example, and proposed a simple method for ATP-binding sites prediction, named as CLCLpred (Contextual Local evolutionary Conservation-based method for Ligand-binding prediction). Our method employed no predicted results from other classifiers as input; all used features were extracted from PSSM only. We tested our method on 2 separate data sets. Experimental results showed that, comparing with other 9 existing methods on the same data sets, our method achieved the best performance. Conclusions This study demonstrates that: 1) exploiting the signal from the detailed conservation pattern of residues will largely facilitate the prediction of protein functional sites; and 2) the local evolutionary conservation enables accurate prediction of ATP-binding sites directly from protein sequence.
Collapse
|
62
|
Li BQ, Feng KY, Ding J, Cai YD. Predicting DNA-binding sites of proteins based on sequential and 3D structural information. Mol Genet Genomics 2014; 289:489-99. [PMID: 24448651 DOI: 10.1007/s00438-014-0812-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2013] [Accepted: 01/04/2014] [Indexed: 11/26/2022]
Abstract
Protein-DNA interactions play important roles in many biological processes. To understand the molecular mechanisms of protein-DNA interaction, it is necessary to identify the DNA-binding sites in DNA-binding proteins. In the last decade, computational approaches have been developed to predict protein-DNA-binding sites based solely on protein sequences. In this study, we developed a novel predictor based on support vector machine algorithm coupled with the maximum relevance minimum redundancy method followed by incremental feature selection. We incorporated not only features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure, solvent accessibility, but also five three-dimensional (3D) structural features calculated from PDB data to predict the protein-DNA interaction sites. Feature analysis showed that 3D structural features indeed contributed to the prediction of DNA-binding site and it was demonstrated that the prediction performance was better with 3D structural features than without them. It was also shown via analysis of features from each site that the features of DNA-binding site itself contribute the most to the prediction. Our prediction method may become a useful tool for identifying the DNA-binding sites and the feature analysis described in this paper may provide useful insights for in-depth investigations into the mechanisms of protein-DNA interaction.
Collapse
Affiliation(s)
- Bi-Qing Li
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, People's Republic of China
| | | | | | | |
Collapse
|
63
|
Liu R, Hu J. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins 2013; 81:1885-99. [DOI: 10.1002/prot.24330] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2013] [Revised: 05/02/2013] [Accepted: 05/12/2013] [Indexed: 01/10/2023]
Affiliation(s)
- Rong Liu
- Department of Computer Science and Engineering; University of South Carolina; Columbia South Carolina 29208
- Center for Bioinformatics; College of Life Science and Technology; Huazhong Agricultural University; Wuhan 430070 People's Republic of China
| | - Jianjun Hu
- Department of Computer Science and Engineering; University of South Carolina; Columbia South Carolina 29208
| |
Collapse
|
64
|
Abstract
In this study, we present the DNA-Binding Site Identifier (DBSI), a new structure-based method for predicting protein interaction sites for DNA binding. DBSI was trained and validated on a data set of 263 proteins (TRAIN-263), tested on an independent set of protein-DNA complexes (TEST-206) and data sets of 29 unbound (APO-29) and 30 bound (HOLO-30) protein structures distinct from the training data. We computed 480 candidate features for identifying protein residues that bind DNA, including new features that capture the electrostatic microenvironment within shells near the protein surface. Our iterative feature selection process identified features important in other models, as well as features unique to the DBSI model, such as a banded electrostatic feature with spatial separation comparable with the canonical width of the DNA minor groove. Validations and comparisons with established methods using a range of performance metrics clearly demonstrate the predictive advantage of DBSI, and its comparable performance on unbound (APO-29) and bound (HOLO-30) conformations demonstrates robustness to binding-induced protein conformational changes. Finally, we offer our feature data table to others for integration into their own models or for testing improved feature selection and model training strategies based on DBSI.
Collapse
Affiliation(s)
- Xiaolei Zhu
- BACTER Institute, University of Wisconsin-Madison, Madison, WI, USA, Departments of Mathematics and Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | | | | |
Collapse
|
65
|
Zhu Y, Zhou W, Dai DQ, Yan H. Identification of DNA-binding and protein-binding proteins using enhanced graph wavelet features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1017-1031. [PMID: 24334394 DOI: 10.1109/tcbb.2013.117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Interactions between biomolecules play an essential role in various biological processes. For predicting DNA-binding or protein-binding proteins, many machine-learning-based techniques have used various types of features to represent the interface of the complexes, but they only deal with the properties of a single atom in the interface and do not take into account the information of neighborhood atoms directly. This paper proposes a new feature representation method for biomolecular interfaces based on the theory of graph wavelet. The enhanced graph wavelet features (EGWF) provides an effective way to characterize interface feature through adding physicochemical features and exploiting a graph wavelet formulation. Particularly, graph wavelet condenses the information around the center atom, and thus enhances the discrimination of features of biomolecule binding proteins in the feature space. Experiment results show that EGWF performs effectively for predicting DNA-binding and protein-binding proteins in terms of Matthew's correlation coefficient (MCC) score and the area value under the receiver operating characteristic curve (AUC).
Collapse
Affiliation(s)
- Yuan Zhu
- Guangdong University of Finance and Economics, Guangzhou and Sun Yat-Sen University, Guangzhou
| | | | | | - Hong Yan
- City University of Hong Kong, Hong Kong and University of Sydney, Sydney
| |
Collapse
|
66
|
Wong KC, Chan TM, Peng C, Li Y, Zhang Z. DNA motif elucidation using belief propagation. Nucleic Acids Res 2013; 41:e153. [PMID: 23814189 PMCID: PMC3763557 DOI: 10.1093/nar/gkt574] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 ∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors’ websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada, Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Department of Integrative Biology and Physiology, University of California Los Angeles, Los Angeles, CA, USA, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | | | | | | | | |
Collapse
|
67
|
Nagarajan R, Ahmad S, Gromiha MM. Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins. Nucleic Acids Res 2013; 41:7606-14. [PMID: 23788679 PMCID: PMC3763535 DOI: 10.1093/nar/gkt544] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Protein-DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments.
Collapse
Affiliation(s)
- R Nagarajan
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600036, India and National Institute of Biomedical Innovation, Osaka, Japan
| | | | | |
Collapse
|
68
|
Chan TM, Lo LY, Sze-To HY, Leung KS, Xiao X, Wong MH. Modeling associated protein-DNA pattern discovery with unified scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:696-707. [PMID: 24091402 DOI: 10.1109/tcbb.2013.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Understanding protein-DNA interactions, specifically transcription factor (TF) and transcription factor binding site (TFBS) bindings, is crucial in deciphering gene regulation. The recent associated TF-TFBS pattern discovery combines one-sided motif discovery on both the TF and the TFBS sides. Using sequences only, it identifies the short protein-DNA binding cores available only in high-resolution 3D structures. The discovered patterns lead to promising subtype and disease analysis applications. While the related studies use either association rule mining or existing TFBS annotations, none has proposed any formal unified (both-sided) model to prioritize the top verifiable associated patterns. We propose the unified scores and develop an effective pipeline for associated TF-TFBS pattern discovery. Our stringent instance-level evaluations show that the patterns with the top unified scores match with the binding cores in 3D structures considerably better than the previous works, where up to 90 percent of the top 20 scored patterns are verified. We also introduce extended verification from literature surveys, where the high unified scores correspond to even higher verification percentage. The top scored patterns are confirmed to match the known WRKY binding cores with no available 3D structures and agree well with the top binding affinities of in vivo experiments.
Collapse
|
69
|
Li T, Li QZ, Liu S, Fan GL, Zuo YC, Peng Y. PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information. ACTA ACUST UNITED AC 2013; 29:678-85. [PMID: 23335013 DOI: 10.1093/bioinformatics/btt029] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
MOTIVATION Protein-DNA interactions often take part in various crucial processes, which are essential for cellular function. The identification of DNA-binding sites in proteins is important for understanding the molecular mechanisms of protein-DNA interaction. Thus, we have developed an improved method to predict DNA-binding sites by integrating structural alignment algorithm and support vector machine-based methods. RESULTS Evaluated on a new non-redundant protein set with 224 chains, the method has 80.7% sensitivity and 82.9% specificity in the 5-fold cross-validation test. In addition, it predicts DNA-binding sites with 85.1% sensitivity and 85.3% specificity when tested on a dataset with 62 protein-DNA complexes. Compared with a recently published method, BindN+, our method predicts DNA-binding sites with a 7% better area under the receiver operating characteristic curve value when tested on the same dataset. Many important problems in cell biology require the dense non-linear interactions between functional modules be considered. Thus, our prediction method will be useful in detecting such complex interactions.
Collapse
Affiliation(s)
- Tao Li
- Laboratory of Theoretical Biophysics, School of Physical Sciences and Technology, College of Computer Science and The National Research Center for Animal Transgenic Biotechnology, Inner Mongolia University, Hohhot, 010021, China
| | | | | | | | | | | |
Collapse
|
70
|
Gromiha MM, Nagarajan R. Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2013; 91:65-99. [PMID: 23790211 DOI: 10.1016/b978-0-12-411637-5.00003-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Protein-DNA recognition plays an important role in the regulation of gene expression. Understanding the influence of specific residues for protein-DNA interactions and the recognition mechanism of protein-DNA complexes is a challenging task in molecular and computational biology. Several computational approaches have been put forward to tackle these problems from different perspectives: (i) development of databases for the interactions between protein and DNA and binding specificity of protein-DNA complexes, (ii) structural analysis of protein-DNA complexes, (iii) discriminating DNA-binding proteins from amino acid sequence, (iv) prediction of DNA-binding sites and protein-DNA binding specificity using sequence and/or structural information, and (v) understanding the recognition mechanism of protein-DNA complexes. In this review, we focus on all these issues and extensively discuss the advancements on the development of comprehensive bioinformatics databases for protein-DNA interactions, efficient tools for identifying the binding sites, and plausible mechanisms for understanding the recognition of protein-DNA complexes. Further, the available online resources for understanding protein-DNA interactions are collectively listed, which will serve as ready-to-use information for the research community.
Collapse
Affiliation(s)
- M Michael Gromiha
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamil Nadu, India.
| | | |
Collapse
|
71
|
PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites. PLoS One 2012; 7:e50300. [PMID: 23209700 PMCID: PMC3510211 DOI: 10.1371/journal.pone.0050300] [Citation(s) in RCA: 228] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 10/18/2012] [Indexed: 12/04/2022] Open
Abstract
The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSPER/.
Collapse
|
72
|
Ma X, Guo J, Liu HD, Xie JM, Sun X. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1766-1775. [PMID: 22868682 DOI: 10.1109/tcbb.2012.106] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew’s correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.
Collapse
Affiliation(s)
- Xin Ma
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University and Nanjing Audit University, Nanjing, P.R. China.
| | | | | | | | | |
Collapse
|
73
|
Wang DD, Li TH, Sun JM, Li DP, Xiong WW, Wang WY, Tang SN. Shape string: a new feature for prediction of DNA-binding residues. Biochimie 2012; 95:354-8. [PMID: 23116714 DOI: 10.1016/j.biochi.2012.10.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2012] [Accepted: 10/08/2012] [Indexed: 10/27/2022]
Abstract
Protein-DNA interactions are involved in many biological processes essential for gene expression and regulation. To understand the molecular mechanisms of protein-DNA recognition, it is crucial to analyze and identify DNA-binding residues of protein-DNA complexes. Here, we proposed a novel descriptor shape string and another two related features shape string PSSM and shape string pair composition to characterize DNA-binding residues. We employed the new features and the position-specific scoring matrix (PSSM) for modeling and prediction. The results of a benchmark dataset showed that our approach significantly improved the accuracy of the predictor. The overall accuracy of our approach reached 85.86% with 85.02% sensitivity and 86.02% specificity. The results also demonstrated that shape string is a powerful descriptor for the prediction of DNA-binding residues. The additional two related features enhanced the predictive value.
Collapse
Affiliation(s)
- Duo-Duo Wang
- Department of Chemistry, Tongji University, Shanghai 200092, PR China
| | | | | | | | | | | | | |
Collapse
|
74
|
Chan TM, Leung KS, Lee KH, Wong MH, Lau TCK, Tsui SKW. Subtypes of associated protein-DNA (Transcription Factor-Transcription Factor Binding Site) patterns. Nucleic Acids Res 2012; 40:9392-403. [PMID: 22904079 PMCID: PMC3479201 DOI: 10.1093/nar/gks749] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
In protein–DNA interactions, particularly transcription factor (TF) and transcription factor binding site (TFBS) bindings, associated residue variations form patterns denoted as subtypes. Subtypes may lead to changed binding preferences, distinguish conserved from flexible binding residues and reveal novel binding mechanisms. However, subtypes must be studied in the context of core bindings. While solving 3D structures would require huge experimental efforts, recent sequence-based associated TF-TFBS pattern discovery has shown to be promising, upon which a large-scale subtype study is possible and desirable. In this article, we investigate residue-varying subtypes based on associated TF-TFBS patterns. By re-categorizing the patterns with respect to varying TF amino acids, statistically significant (P values ≤ 0.005) subtypes leading to varying TFBS patterns are discovered without using TF family or domain annotations. Resultant subtypes have various biological meanings. The subtypes reflect familial and functional properties and exhibit changed binding preferences supported by 3D structures. Conserved residues critical for maintaining TF-TFBS bindings are revealed by analyzing the subtypes. In-depth analysis on the subtype pair PKVVIL-CACGTG versus PKVEIL-CAGCTG shows the V/E variation is indicative for distinguishing Myc from MRF families. Discovered from sequences only, the TF-TFBS subtypes are informative and promising for more biological findings, complementing and extending recent one-sided subtype and familial studies with comprehensive evidence.
Collapse
Affiliation(s)
- Tak-Ming Chan
- Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin, N T, Hong Kong.
| | | | | | | | | | | |
Collapse
|
75
|
Schaefer C, Bromberg Y, Achten D, Rost B. Disease-related mutations predicted to impact protein function. BMC Genomics 2012; 13 Suppl 4:S11. [PMID: 22759649 PMCID: PMC3394413 DOI: 10.1186/1471-2164-13-s4-s11] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background Non-synonymous single nucleotide polymorphisms (nsSNPs) alter the protein sequence and can cause disease. The impact has been described by reliable experiments for relatively few mutations. Here, we study predictions for functional impact of disease-annotated mutations from OMIM, PMD and Swiss-Prot and of variants not linked to disease. Results Most disease-causing mutations were predicted to impact protein function. More surprisingly, the raw predictions scores for disease-causing mutations were higher than the scores for the function-altering data set originally used for developing the prediction method (here SNAP). We might expect that diseases are caused by change-of-function mutations. However, it is surprising how well prediction methods developed for different purposes identify this link. Conversely, our predictions suggest that the set of nsSNPs not currently linked to diseases contains very few strong disease associations to be discovered. Conclusions Firstly, annotations of disease-causing nsSNPs are on average so reliable that they can be used as proxies for functional impact. Secondly, disease-causing nsSNPs can be identified very well by methods that predict the impact of mutations on protein function. This implies that the existing prediction methods provide a very good means of choosing a set of suspect SNPs relevant for disease.
Collapse
Affiliation(s)
- Christian Schaefer
- Bioinformatics-i12, Informatics, Technical University Munich, Boltzmannstrasse 3, Garching/Munich, Germany.
| | | | | | | |
Collapse
|
76
|
Abstract
Background Amino acid point mutations (nsSNPs) may change protein structure and function. However, no method directly predicts the impact of mutations on structure. Here, we compare pairs of pentamers (five consecutive residues) that locally change protein three-dimensional structure (3D, RMSD>0.4Å) to those that do not alter structure (RMSD<0.2Å). Mutations that alter structure locally can be distinguished from those that do not through a machine-learning (logistic regression) method. Results The method achieved a rather high overall performance (AUC>0.79, two-state accuracy >72%). This discriminative power was particularly unexpected given the enormous structural variability of pentamers. Mutants for which our method predicted a change of structure were also enriched in terms of disrupting stability and function. Although distinguishing change and no change in structure, the new method overall failed to distinguish between mutants with and without effect on stability or function. Conclusions Local structural change can be predicted. Future work will have to establish how useful this new perspective on predicting the effect of nsSNPs will be in combination with other methods.
Collapse
Affiliation(s)
- Christian Schaefer
- TUM, Bioinformatics-I12, Informatik, Boltzmannstrasse 3, Garching, Germany.
| | | |
Collapse
|
77
|
Chen YC, Wright JD, Lim C. DR_bind: a web server for predicting DNA-binding residues from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res 2012; 40:W249-56. [PMID: 22661576 PMCID: PMC3394278 DOI: 10.1093/nar/gks481] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
DR_bind is a web server that automatically predicts DNA-binding residues, given the respective protein structure based on (i) electrostatics, (ii) evolution and (iii) geometry. In contrast to machine-learning methods, DR_bind does not require a training data set or any parameters. It predicts DNA-binding residues by detecting a cluster of conserved, solvent-accessible residues that are electrostatically stabilized upon mutation to Asp−/Glu−. The server requires as input the DNA-binding protein structure in PDB format and outputs a downloadable text file of the predicted DNA-binding residues, a 3D visualization of the predicted residues highlighted in the given protein structure, and a downloadable PyMol script for visualization of the results. Calibration on 83 and 55 non-redundant DNA-bound and DNA-free protein structures yielded a DNA-binding residue prediction accuracy/precision of 90/47% and 88/42%, respectively. Since DR_bind does not require any training using protein–DNA complex structures, it may predict DNA-binding residues in novel structures of DNA-binding proteins resulting from structural genomics projects with no conservation data. The DR_bind server is freely available with no login requirement at http://dnasite.limlab.ibms.sinica.edu.tw.
Collapse
Affiliation(s)
- Yao Chi Chen
- Institute of Biomedical Sciences, Genomics Research Center, Academia Sinica, Taipei 115, Taiwan
| | | | | |
Collapse
|
78
|
Dey S, Pal A, Guharoy M, Sonavane S, Chakrabarti P. Characterization and prediction of the binding site in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation and structural parameters. Nucleic Acids Res 2012; 40:7150-61. [PMID: 22641851 PMCID: PMC3424558 DOI: 10.1093/nar/gks405] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
We present a set of four parameters that in combination can predict DNA-binding residues on protein structures to a high degree of accuracy. These are the number of evolutionary conserved residues (Ncons) and their spatial clustering (ρe), hydrogen bond donor capability (Dp) and residue propensity (Rp). We first used these parameters to characterize 130 interfaces in a set of 126 DNA-binding proteins (DBPs). The applicability of these parameters both individually and in combination, to distinguish the true binding region from the rest of the protein surface was then analyzed. Rp shows the best performance identifying the true interface with the top rank in 83% cases. Importantly, we also used the unbound-bound test cases of the protein–DNA docking benchmark to test the efficacy of our method. When applied to the unbound form of the DBPs, Rp can distinguish 86% cases. Finally, we have applied the SVM approach for recognizing the interface region using the above parameters along with the individual amino acid composition as attributes. The accuracy of prediction is 90.5% for the bound structures and 93.6% for the unbound form of the proteins.
Collapse
Affiliation(s)
- Sucharita Dey
- Bioinformatics Centre, Bose Institute, P-1/12 CIT Scheme VIIM, Kolkata 700 054, India
| | | | | | | | | |
Collapse
|
79
|
Qin S, Zhou HX. Structural models of protein-DNA complexes based on interface prediction and docking. Curr Protein Pept Sci 2012; 12:531-9. [PMID: 21787304 DOI: 10.2174/138920311796957694] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Revised: 04/01/2011] [Accepted: 05/04/2011] [Indexed: 11/22/2022]
Abstract
Protein-DNA interactions are the physical basis of gene expression and DNA modification. Structural models that reveal these interactions are essential for their understanding. As only a limited number of structures for protein-DNA complexes have been determined by experimental methods, computation methods provide a potential way to fill the need. We have developed the DISPLAR method to predict DNA binding sites on proteins. Predicted binding sites have been used to assist the building of structural models by docking, either by guiding the docking or by selecting near-native candidates from the docked poses. Here we applied the DISPLAR method to predict the DNA binding sites for 20 DNA-binding proteins, which have had their DNA binding sites characterized by NMR chemical shift perturbation. For two of these proteins, the structures of their complexes with DNA have also been determined. With the help of the DISPLAR predictions, we built structural models for these two complexes. Evaluations of both the DNA binding sites for 20 proteins and the structural models of the two protein-DNA complexes against experimental results demonstrate the significant promise of our model-building approach.
Collapse
Affiliation(s)
- Sanbo Qin
- Department of Physics and Institute of Molecular Biophysics, Florida State University, Tallahassee, FL 32306, USA
| | | |
Collapse
|
80
|
Walia RR, Caragea C, Lewis BA, Towfic F, Terribilini M, El-Manzalawy Y, Dobbs D, Honavar V. Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics 2012; 13:89. [PMID: 22574904 PMCID: PMC3490755 DOI: 10.1186/1471-2105-13-89] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 05/10/2012] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
Collapse
Affiliation(s)
- Rasna R Walia
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Computer Science, Iowa State University, USA
| | - Cornelia Caragea
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- College of Information Sciences & Technology, The Pennsylvania State University, University Park, USA
| | - Benjamin A Lewis
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Genetics, Development and Cell Biology, , USA
| | - Fadi Towfic
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- The Broad Institute, USA
| | | | - Yasser El-Manzalawy
- Department of Computer Science, Iowa State University, USA
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
- Department of Systems & Computer Engineering, Al-Azhar University, Egypt
| | - Drena Dobbs
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Genetics, Development and Cell Biology, , USA
| | - Vasant Honavar
- Bioinformatics and Computational Biology Program, Iowa State University, USA
- Department of Computer Science, Iowa State University, USA
- Center for Computational Intelligence, Learning and Discovery, Iowa State University, USA
| |
Collapse
|
81
|
Qi Y, Oja M, Weston J, Noble WS. A unified multitask architecture for predicting local protein properties. PLoS One 2012; 7:e32235. [PMID: 22461885 PMCID: PMC3312883 DOI: 10.1371/journal.pone.0032235] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 01/25/2012] [Indexed: 01/27/2023] Open
Abstract
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
Collapse
Affiliation(s)
- Yanjun Qi
- Machine Learning Department, NEC Labs America, Princeton, New Jersey, United States of America
| | - Merja Oja
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Jason Weston
- Google, New York, New York, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
82
|
Song J, Tan H, Wang M, Webb GI, Akutsu T. TANGLE: two-level support vector regression approach for protein backbone torsion angle prediction from primary sequences. PLoS One 2012; 7:e30361. [PMID: 22319565 PMCID: PMC3271071 DOI: 10.1371/journal.pone.0030361] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 12/14/2011] [Indexed: 12/29/2022] Open
Abstract
Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia
- * E-mail: (JS); (GIW); (TA)
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan
- * E-mail: (JS); (GIW); (TA)
| |
Collapse
|
83
|
Xiong Y, Xia J, Zhang W, Liu J. Exploiting a reduced set of weighted average features to improve prediction of DNA-binding residues from 3D structures. PLoS One 2011; 6:e28440. [PMID: 22174808 PMCID: PMC3234263 DOI: 10.1371/journal.pone.0028440] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2011] [Accepted: 11/08/2011] [Indexed: 01/29/2023] Open
Abstract
Predicting DNA-binding residues from a protein three-dimensional structure is a key task of computational structural proteomics. In the present study, based on machine learning technology, we aim to explore a reduced set of weighted average features for improving prediction of DNA-binding residues on protein surfaces. Via constructing the spatial environment around a DNA-binding residue, a novel weighting factor is first proposed to quantify the distance-dependent contribution of each neighboring residue in determining the location of a binding residue. Then, a weighted average scheme is introduced to represent the surface patch of the considering residue. Finally, the classifier is trained on the reduced set of these weighted average features, consisting of evolutionary profile, interface propensity, betweenness centrality and solvent surface area of side chain. Experimental results on 5-fold cross validation and independent tests indicate that the new feature set are effective to describe DNA-binding residues and our approach has significantly better performance than two previous methods. Furthermore, a brief case study suggests that the weighted average features are powerful for identifying DNA-binding residues and are promising for further study of protein structure-function relationship. The source code and datasets are available upon request.
Collapse
Affiliation(s)
- Yi Xiong
- School of Computer, Wuhan University, Wuhan, China
| | - Junfeng Xia
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, Tennessee, United States of America
| | - Wen Zhang
- School of Computer, Wuhan University, Wuhan, China
| | - Juan Liu
- School of Computer, Wuhan University, Wuhan, China
- * E-mail:
| |
Collapse
|
84
|
Sarfstein R, Pasmanik-Chor M, Yeheskel A, Edry L, Shomron N, Warman N, Wertheimer E, Maor S, Shochat L, Werner H. Insulin-like growth factor-I receptor (IGF-IR) translocates to nucleus and autoregulates IGF-IR gene expression in breast cancer cells. J Biol Chem 2011; 287:2766-76. [PMID: 22128190 DOI: 10.1074/jbc.m111.281782] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
The insulin-like growth factor (IGF) system plays an important role in mammary gland biology as well as in the etiology of breast cancer. The IGF-I receptor (IGF-IR), which mediates the biological actions of IGF-I and IGF-II, has emerged in recent years as a promising therapeutic target. The IGF and estrogen signaling pathways act in a synergistic manner in breast epithelial cells. The present study was aimed at investigating 1) the putative translocation of IGF-IR and the related insulin receptor (IR) to the nucleus in breast cancer cells, 2) the impact of IGF-IR and IR levels on IGF-IR biosynthesis in estrogen receptor (ER)-positive and ER-depleted breast cancer cells, and 3) the potential transcription factor role of IGF-IR in the specific context of IGF-IR gene regulation. We describe here a novel mechanism of autoregulation of IGF-IR gene expression by cellular IGF-IR, which is seemingly dependent on ER status. Regulation of the IGF-IR gene by IGF-IR protein is mediated at the level of transcription, as demonstrated by 1) binding assays (DNA affinity chromatography and ChIP) showing specific IGF-IR binding to IGF-IR promoter DNA and 2) transient transfection assays showing transactivation of the IGF-IR promoter by exogenous IGF-IR. The IR is also capable of translocating to the nucleus and binding the IGF-IR promoter in ER-depleted, but not in ER-positive, cells. However, transcription factors IGF-IR and IR display diametrically opposite activities in the context of IGF-IR gene regulation. Thus, whereas IGF-IR stimulated IGF-IR gene expression, IR inhibited IGF-IR promoter activity. In summary, we have identified a novel mechanism of IGF-IR gene autoregulation in breast cancer cells. The clinical implications of these findings and, in particular, the impact of IGF-IR/IR nuclear localization on targeted therapy require further investigation.
Collapse
Affiliation(s)
- Rive Sarfstein
- Department of Human Molecular Genetics and Biochemistry, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | | | | | | | | | | | | | |
Collapse
|
85
|
Milanowska K, Rother K, Bujnicki JM. Databases and bioinformatics tools for the study of DNA repair. Mol Biol Int 2011; 2011:475718. [PMID: 22091405 PMCID: PMC3200286 DOI: 10.4061/2011/475718] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 04/28/2011] [Accepted: 05/22/2011] [Indexed: 12/12/2022] Open
Abstract
DNA is continuously exposed to many different damaging agents such as environmental chemicals, UV light, ionizing radiation, and reactive cellular metabolites. DNA lesions can result in different phenotypical consequences ranging from a number of diseases, including cancer, to cellular malfunction, cell death, or aging. To counteract the deleterious effects of DNA damage, cells have developed various repair systems, including biochemical pathways responsible for the removal of single-strand lesions such as base excision repair (BER) and nucleotide excision repair (NER) or specialized polymerases temporarily taking over lesion-arrested DNA polymerases during the S phase in translesion synthesis (TLS). There are also other mechanisms of DNA repair such as homologous recombination repair (HRR), nonhomologous end-joining repair (NHEJ), or DNA damage response system (DDR). This paper reviews bioinformatics resources specialized in disseminating information about DNA repair pathways, proteins involved in repair mechanisms, damaging agents, and DNA lesions.
Collapse
Affiliation(s)
- Kaja Milanowska
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland
| | | | | |
Collapse
|
86
|
Wang CC, Chen CY. Predicting DNA-binding locations and orientation on proteins using knowledge-based learning of geometric properties. Proteome Sci 2011; 9 Suppl 1:S11. [PMID: 22166082 PMCID: PMC3289072 DOI: 10.1186/1477-5956-9-s1-s11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA-binding proteins perform their functions through specific or non-specific sequence recognition. Although many sequence- or structure-based approaches have been proposed to identify DNA-binding residues on proteins or protein-binding sites on DNA sequences with satisfied performance, it remains a challenging task to unveil the exact mechanism of protein-DNA interactions without crystal complex structures. Without information from complexes, the linkages between DNA-binding proteins and their binding sites on DNA are still missing. METHODS While it is still difficult to acquire co-crystallized structures in an efficient way, this study proposes a knowledge-based learning method to effectively predict DNA orientation and base locations around the protein's DNA-binding sites when given a protein structure. First, the functionally important residues of a query protein are predicted by a sequential pattern mining tool. After that, surface residues falling in the predicted functional regions are determined based on the given structure. These residues are then clustered based on their spatial coordinates and the resultant clusters are ranked by a proposed DNA-binding propensity function. Clusters with high DNA-binding propensities are treated as DNA-binding units (DBUs) and each DBU is analyzed by principal component analysis (PCA) to predict potential orientation of DNA grooves. More specifically, the proposed method is developed to predict the direction of the tangent line to the helix curve of the DNA groove where a DBU is going to bind. RESULTS This paper proposes a knowledge-based learning procedure to determine the spatial location of the DNA groove with respect to the query protein structure by considering geometric propensity between protein side chains and DNA bases. The 11 test cases used in this study reveal that the location and orientation of the DNA groove around a selected DBU can be predicted with satisfied errors. CONCLUSIONS This study presents a method to predict the location and orientation of DNA grooves with respect to the structure of a DNA-binding protein. The test cases shown in this study reveal the possibility of imaging protein-DNA binding conformation before co-crystallized structure can be determined. How the proposed method can be incorporated with existing protein-DNA docking tools to study protein-DNA interactions deserve further studies in the near future.
Collapse
Affiliation(s)
- Chien-Chih Wang
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan.
| | | |
Collapse
|
87
|
Fedonin GG, Rakhmaninova AB, Korostelev YD, Laikova ON, Gelfand MS. Machine learning study of DNA binding by transcription factors from the LacI family. Mol Biol 2011. [DOI: 10.1134/s0026893311040054] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
88
|
Conformational studies and solvent-accessible surface area analysis of known selective DNA G-Quadruplex binders. Biochimie 2011; 93:1267-74. [DOI: 10.1016/j.biochi.2011.06.014] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2011] [Accepted: 06/14/2011] [Indexed: 12/18/2022]
|
89
|
Si J, Zhang Z, Lin B, Schroeder M, Huang B. MetaDBSite: a meta approach to improve protein DNA-binding sites prediction. BMC SYSTEMS BIOLOGY 2011; 5 Suppl 1:S7. [PMID: 21689482 PMCID: PMC3121123 DOI: 10.1186/1752-0509-5-s1-s7] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background Protein-DNA interactions play an important role in many fundamental biological activities such as DNA replication, transcription and repair. Identification of amino acid residues involved in DNA binding site is critical for understanding of the mechanism of gene regulations. In the last decade, there have been a number of computational approaches developed to predict protein-DNA binding sites based on protein sequence and/or structural information. Results In this article, we present metaDBSite, a meta web server to predict DNA-binding residues for DNA-binding proteins. MetaDBSite integrates the prediction results from six available online web servers: DISIS, DNABindR, BindN, BindN-rf, DP-Bind and DBS-PRED and it solely uses sequence information of proteins. A large dataset of DNA-binding proteins is constructed from the Protein Data Bank and it serves as a gold-standard benchmark to evaluate the metaDBSite approach and the other six predictors. Conclusions The comparison results show that metaDBSite outperforms single individual approach. We believe that metaDBSite will become a useful and integrative tool for protein DNA-binding residues prediction. The MetaDBSite web-server is freely available at http://projects.biotec.tu-dresden.de/metadbsite/ and http://sysbio.zju.edu.cn/metadbsite.
Collapse
Affiliation(s)
- Jingna Si
- Systems Biology Division, Zhejiang-California International NanoSystems Institute, Zhejiang University, Kaixuan Road 268, Hangzhou, China
| | | | | | | | | |
Collapse
|
90
|
Xiong Y, Liu J, Wei DQ. An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins 2011; 79:509-17. [PMID: 21069866 DOI: 10.1002/prot.22898] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Proteins that interact with DNA play vital roles in all mechanisms of gene expression and regulation. In order to understand these activities, it is crucial to analyze and identify DNA-binding residues on DNA-binding protein surfaces. Here, we proposed two novel features B-factor and packing density in combination with several conventional features to characterize the DNA-binding residues in a well-constructed representative dataset of 119 protein-DNA complexes from the Protein Data Bank (PDB). Based on the selected features, a prediction model for DNA-binding residues was constructed using support vector machine (SVM). The predictor was evaluated using a 5-fold cross validation on above dataset of 123 DNA-binding proteins. Moreover, two independent datasets of 83 DNA-bound protein structures and their corresponding DNA-free forms were compiled. The B-factor and packing density features were statistically analyzed on these 83 pairs of holo-apo proteins structures. Finally, we developed the SVM model to accurately predict DNA-binding residues on protein surface, given the DNA-free structure of a protein. Results showed here indicate that our method represents a significant improvement of previously existing approaches such as DISPLAR. The observation suggests that our method will be useful in studying protein-DNA interactions to guide consequent works such as site-directed mutagenesis and protein-DNA docking.
Collapse
Affiliation(s)
- Yi Xiong
- School of Computer, Wuhan University, Wuhan 430072, People's Republic of China
| | | | | |
Collapse
|
91
|
Gromiha MM, Fukui K. Scoring function based approach for locating binding sites and understanding recognition mechanism of protein-DNA complexes. J Chem Inf Model 2011; 51:721-9. [PMID: 21361378 DOI: 10.1021/ci1003703] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Protein-DNA recognition plays an essential role in the regulation of gene expression. Understanding the recognition mechanism of protein-DNA complexes is a challenging task in molecular and computational biology. In this work, a scoring function based approach has been developed for identifying the binding sites and delineating the important residues for binding in protein-DNA complexes. This approach considers both the repulsive interactions and the effect of distance between atoms in protein and DNA. The results showed that positively charged, polar, and aromatic residues are important for binding. These residues influence the formation of electrostatic, hydrogen bonding, and stacking interactions. Our observation has been verified with experimental binding specificity of protein-DNA complexes and found to be in good agreement with experiments. The comparison of protein-RNA and protein-DNA complexes reveals that the contribution of phosphate atoms in DNA is twice as large as in protein-RNA complexes. Furthermore, we observed that the positively charged, polar, and aromatic residues serve as hotspot residues in protein-RNA complexes, whereas other residues also altered the binding specificity in protein-DNA complexes. Based on the results obtained in the present study and related reports, a plausible mechanism has been proposed for the recognition of protein-DNA complexes.
Collapse
Affiliation(s)
- M Michael Gromiha
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai, Tamilnadu, India.
| | | |
Collapse
|
92
|
Wong KC, Peng C, Wong MH, Leung KS. Generalizing and learning protein-DNA binding sequence representations by an evolutionary algorithm. Soft comput 2011. [DOI: 10.1007/s00500-011-0692-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
93
|
van Ham TJ, Holmberg MA, van der Goot AT, Teuling E, Garcia-Arencibia M, Kim HE, Du D, Thijssen KL, Wiersma M, Burggraaff R, van Bergeijk P, van Rheenen J, Jerre van Veluw G, Hofstra RMW, Rubinsztein DC, Nollen EAA. Identification of MOAG-4/SERF as a regulator of age-related proteotoxicity. Cell 2010; 142:601-12. [PMID: 20723760 DOI: 10.1016/j.cell.2010.07.020] [Citation(s) in RCA: 100] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2009] [Revised: 03/31/2010] [Accepted: 06/17/2010] [Indexed: 01/03/2023]
Abstract
Fibrillar protein aggregates are the major pathological hallmark of several incurable, age-related, neurodegenerative disorders. These aggregates typically contain aggregation-prone pathogenic proteins, such as amyloid-beta in Alzheimer's disease and alpha-synuclein in Parkinson's disease. It is, however, poorly understood how these aggregates are formed during cellular aging. Here we identify an evolutionarily highly conserved modifier of aggregation, MOAG-4, as a positive regulator of aggregate formation in C. elegans models for polyglutamine diseases. Inactivation of MOAG-4 suppresses the formation of compact polyglutamine aggregation intermediates that are required for aggregate formation. The role of MOAG-4 in driving aggregation extends to amyloid-beta and alpha-synuclein and is evolutionarily conserved in its human orthologs SERF1A and SERF2. MOAG-4/SERF appears to act independently from HSF-1-induced molecular chaperones, proteasomal degradation, and autophagy. Our results suggest that MOAG-4/SERF regulates age-related proteotoxicity through a previously unexplored pathway, which will open up new avenues for research on age-related, neurodegenerative diseases.
Collapse
Affiliation(s)
- Tjakko J van Ham
- Department of Genetics, University Medical Centre Groningen, University of Groningen, Hanzeplein 1, 9700 RB Groningen, the Netherlands
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
94
|
Cai Y, He Z, Shi X, Kong X, Gu L, Xie L. A novel sequence-based method of predicting protein DNA-binding residues, using a machine learning approach. Mol Cells 2010; 30:99-105. [PMID: 20706794 DOI: 10.1007/s10059-010-0093-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2009] [Revised: 04/06/2010] [Accepted: 04/22/2010] [Indexed: 11/29/2022] Open
Abstract
Protein-DNA interactions play an essential role in transcriptional regulation, DNA repair, and many vital biological processes. The mechanism of protein-DNA binding, however, remains unclear. For the study of many diseases, researchers must improve their understanding of the amino acid motifs that recognize DNA. Because identifying these motifs experimentally is expensive and time-consuming, it is necessary to devise an approach for computational prediction. Some in silico methods have been developed, but there are still considerable limitations. In this study, we used a machine learning approach to develop a new sequence-based method of predicting protein-DNA binding residues. To make these predictions, we used the properties of the micro-environment of each amino acid from the AAIndex as well as conservation scores. Testing by the cross-validation method, we obtained an overall accuracy of 94.89%. Our method shows that the amino acid micro-environment is important for DNA binding, and that it is possible to identify the protein-DNA binding sites with it.
Collapse
Affiliation(s)
- Yudong Cai
- Institute of System Biology, Shanghai University, Shanghai, 200244, People's Republic of China.
| | | | | | | | | | | |
Collapse
|
95
|
Leung KS, Wong KC, Chan TM, Wong MH, Lee KH, Lau CK, Tsui SKW. Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Res 2010; 38:6324-37. [PMID: 20529874 PMCID: PMC2965231 DOI: 10.1093/nar/gkq500] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Protein–DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play an essential role in transcriptional regulation. Over the past decades, significant efforts have been made to study the principles for protein–DNA bindings. However, it is considered that there are no simple one-to-one rules between amino acids and nucleotides. Many methods impose complicated features beyond sequence patterns. Protein-DNA bindings are formed from associated amino acid and nucleotide sequence pairs, which determine many functional characteristics. Therefore, it is desirable to investigate associated sequence patterns between TFs and TFBSs. With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF–TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. The patterns found are evaluated by quantitative measurements at several levels on TRANSFAC. With further independent verifications from literatures, Protein Data Bank and homology modeling, there are strong evidences that the patterns discovered reveal real TF–TFBS bindings across different TFs and TFBSs, which can drive for further knowledge to better understand TF–TFBS bindings.
Collapse
Affiliation(s)
- Kwong-Sak Leung
- Department of Computer Science & Engineering, The Chinese University of Hong Kong, and Hong Kong Bioinformatics Centre, Shatin, NT, Hong Kong, China
| | | | | | | | | | | | | |
Collapse
|
96
|
Ozbek P, Soner S, Erman B, Haliloglu T. DNABINDPROT: fluctuation-based predictor of DNA-binding residues within a network of interacting residues. Nucleic Acids Res 2010; 38:W417-23. [PMID: 20478828 PMCID: PMC2896127 DOI: 10.1093/nar/gkq396] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
DNABINDPROT is designed to predict DNA-binding residues, based on the fluctuations of residues in high-frequency modes by the Gaussian network model. The residue pairs that display high mean-square distance fluctuations are analyzed with respect to DNA binding, which are then filtered with their evolutionary conservation profiles and ranked according to their DNA-binding propensities. If the analyses are based on the exact outcome of fluctuations in the highest mode, using a conservation threshold of 5, the results have a sensitivity, specificity, precision and accuracy of 9.3%, 90.5%, 18.1% and 78.6%, respectively, on a dataset of 36 unbound–bound protein structure pairs. These values increase up to 24.3%, 93.4%, 45.3% and 83.3% for the respective cases, when the neighboring two residues are considered. The relatively low sensitivity appears with the identified residues being selective and susceptible more for the binding core residues rather than all DNA-binding residues. The predicted residues that are not tagged as DNA-binding residues are those whose fluctuations are coupled with DNA-binding sites. They are in close proximity as well as plausible for other functional residues, such as ligand and protein–protein interaction sites. DNABINDPROT is free and open to all users without login requirement available at: http://www.prc.boun.edu.tr/appserv/prc/dnabindprot/.
Collapse
Affiliation(s)
- Pemra Ozbek
- Department of Chemical Engineering and Polymer Research Center, Bogazici University, Bebek, 34342 Istanbul
| | | | | | | |
Collapse
|
97
|
Carson MB, Langlois R, Lu H. NAPS: a residue-level nucleic acid-binding prediction server. Nucleic Acids Res 2010; 38:W431-5. [PMID: 20478832 PMCID: PMC2896077 DOI: 10.1093/nar/gkq361] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Nucleic acid-binding proteins are involved in a great number of cellular processes. Understanding the mechanisms underlying these proteins first requires the identification of specific residues involved in nucleic acid binding. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Here, we present a method for the identification of amino acid residues involved in DNA- and RNA-binding using sequence-based attributes. The method used in this work combines the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning. Our DNA-binding model achieved 79.1% accuracy, while the RNA-binding model reached an accuracy of 73.2%. The NAPS web server is freely available at http://proteomics.bioengr.uic.edu/NAPS.
Collapse
Affiliation(s)
- Matthew B Carson
- Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, IL, USA
| | | | | |
Collapse
|
98
|
Aggarwal P, Das Gupta M, Joseph AP, Chatterjee N, Srinivasan N, Nath U. Identification of specific DNA binding residues in the TCP family of transcription factors in Arabidopsis. THE PLANT CELL 2010; 22:1174-89. [PMID: 20363772 PMCID: PMC2879757 DOI: 10.1105/tpc.109.066647] [Citation(s) in RCA: 101] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2009] [Revised: 03/02/2010] [Accepted: 03/22/2010] [Indexed: 05/18/2023]
Abstract
The TCP transcription factors control multiple developmental traits in diverse plant species. Members of this family share an approximately 60-residue-long TCP domain that binds to DNA. The TCP domain is predicted to form a basic helix-loop-helix (bHLH) structure but shares little sequence similarity with canonical bHLH domain. This classifies the TCP domain as a novel class of DNA binding domain specific to the plant kingdom. Little is known about how the TCP domain interacts with its target DNA. We report biochemical characterization and DNA binding properties of a TCP member in Arabidopsis thaliana, TCP4. We have shown that the 58-residue domain of TCP4 is essential and sufficient for binding to DNA and possesses DNA binding parameters comparable to canonical bHLH proteins. Using a yeast-based random mutagenesis screen and site-directed mutants, we identified the residues important for DNA binding and dimer formation. Mutants defective in binding and dimerization failed to rescue the phenotype of an Arabidopsis line lacking the endogenous TCP4 activity. By combining structure prediction, functional characterization of the mutants, and molecular modeling, we suggest a possible DNA binding mechanism for this class of transcription factors.
Collapse
Affiliation(s)
- Pooja Aggarwal
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore 560 012, India
| | - Mainak Das Gupta
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore 560 012, India
| | - Agnel Praveen Joseph
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - Nirmalya Chatterjee
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore 560 012, India
| | - N. Srinivasan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - Utpal Nath
- Department of Microbiology and Cell Biology, Indian Institute of Science, Bangalore 560 012, India
| |
Collapse
|
99
|
Mishra NK, Raghava GPS. Prediction of FAD interacting residues in a protein from its primary sequence using evolutionary information. BMC Bioinformatics 2010; 11 Suppl 1:S48. [PMID: 20122222 PMCID: PMC3009520 DOI: 10.1186/1471-2105-11-s1-s48] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background Flavin binding proteins (FBP) plays a critical role in several biological functions such as electron transport system (ETS). These flavoproteins contain very tightly bound, sometimes covalently, flavin adenine dinucleotide (FAD) or flavin mono nucleotide (FMN). The interaction between flavin nucleotide and amino acids of flavoprotein is essential for their functionality. Thus identification of FAD interacting residues in a FBP is an important step for understanding their function and mechanism. Results In this study, we describe models developed for predicting FAD interacting residues using 15, 17 and 19 window pattern. Support vector machine (SVM) based models have been developed using binary pattern of amino acid sequence of protein and achieved maximum accuracy 69.65% with Mathew's Correlation Coefficient (MCC) 0.39 and Area Under Curve (AUC) 0.773. The performance of these models have been improved significantly from 69.65% to 82.86% with MCC 0.66 and AUC 0.904, when evolutionary information is used as input in SVM. The evolutionary information was generated in form of position specific score matrix (PSSM) profile by using PSI-BLAST at e-value 0.001. All models were developed on 198 non-redundant FAD binding protein chains containing 5172 FAD interacting residues and evaluated using fivefold cross-validation technique. Conclusion This study suggests that evolutionary information of 17 amino acid patterns perform best for FAD interacting residues prediction. We also developed a web server which predicts FAD interacting residues in a protein which is freely available for academics.
Collapse
Affiliation(s)
- Nitish K Mishra
- Institute of Microbial Technology, Sector 39A, Chandigarh, India.
| | | |
Collapse
|
100
|
Abstract
One of the major challenges in the post-genomic era with hundreds of genomes sequenced is the annotation of protein structure and function. Computational predictions of subcellular localization are an important step toward this end. The development of computational tools that predict targeting and localization has, therefore, been a very active area of research, in particular since the first release of the groundbreaking program PSORT in 1991. The most reliable means of annotating protein structure and function remains homology-based inference, i.e. the transfer of experimental annotations from one protein to its homologs. However, annotations about localization demonstrate how much can be gained from advanced machine learning: more proteins can be annotated more reliably. Contemporary computational tools for the annotation of protein targeting include automatic methods that mine the textual information from the biological literature and molecular biology databases. Some machine learning-based methods that accurately predict features of sorting signals and that use sequence-derived features to predict localization have reached remarkable levels of performance. Sustained prediction accuracy has increased by more than 30 percentage points over the last decade. Here, we review some of the most recent methods for the prediction of subcellular localization and protein targeting that contributed toward this breakthrough.
Collapse
Affiliation(s)
- Shruti Rastogi
- Department of Biochemistry and Molecular Biophysics, Columbia University and Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York, NY, USA
| | | |
Collapse
|