1
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
2
|
Sidhanta SPD, Sowdhamini R, Srinivasan N. Comparative analysis of permanent and transient domain-domain interactions in multi-domain proteins. Proteins 2023. [PMID: 37828826 DOI: 10.1002/prot.26581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 08/09/2023] [Accepted: 08/11/2023] [Indexed: 10/14/2023]
Abstract
Protein domains are structural, functional, and evolutionary units. These domains bring out the diversity of functionality by means of interactions with other co-existing domains and provide stability. Hence, it is important to study intra-protein inter-domain interactions from the perspective of types of interactions. Domains within a chain could interact over short timeframes or permanently, rather like protein-protein interactions (PPIs). However, no systematic study has been carried out between two classes, namely permanent and transient domain-domain interactions. In this work, we studied 263 two-domain proteins, belonging to either of these classes and their interfaces on the basis of several factors, such as interface area and details of interactions (number, strength, and types of interactions). We also characterized them based on residue conservation at the interface, correlation of residue motions across domains, its involvement in repeat formation, and their involvement in particular molecular processes. Finally, we could analyze the interactions arising from domains in two-domain monomeric proteins, and we observed significant differences between these two classes of domain interactions and a few similarities. This study will help to obtain a better understanding of structure-function and folding principles of multi-domain proteins.
Collapse
Affiliation(s)
| | - Ramanathan Sowdhamini
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- Computational Approaches to Protein Science, National Centre for Biological Sciences, Bangalore, India
- Computational Biology, Institute of Bioinformatics and Applied Biotechnology, Bangalore, India
| | | |
Collapse
|
3
|
Taheri-Ledari M, Zandieh A, Shariatpanahi SP, Eslahchi C. Assignment of structural domains in proteins using diffusion kernels on graphs. BMC Bioinformatics 2022; 23:369. [PMID: 36076174 PMCID: PMC9461149 DOI: 10.1186/s12859-022-04902-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 08/23/2022] [Indexed: 11/10/2022] Open
Abstract
Though proposing algorithmic approaches for protein domain decomposition has been of high interest, the inherent ambiguity to the problem makes it still an active area of research. Besides, accurate automated methods are in high demand as the number of solved structures for complex proteins is on the rise. While majority of the previous efforts for decomposition of 3D structures are centered on the developing clustering algorithms, employing enhanced measures of proximity between the amino acids has remained rather uncharted. If there exists a kernel function that in its reproducing kernel Hilbert space, structural domains of proteins become well separated, then protein structures can be parsed into domains without the need to use a complex clustering algorithm. Inspired by this idea, we developed a protein domain decomposition method based on diffusion kernels on protein graphs. We examined all combinations of four graph node kernels and two clustering algorithms to investigate their capability to decompose protein structures. The proposed method is tested on five of the most commonly used benchmark datasets for protein domain assignment plus a comprehensive non-redundant dataset. The results show a competitive performance of the method utilizing one of the diffusion kernels compared to four of the best automatic methods. Our method is also able to offer alternative partitionings for the same structure which is in line with the subjective definition of protein domain. With a competitive accuracy and balanced performance for the simple and complex structures despite relying on a relatively naive criterion to choose optimal decomposition, the proposed method revealed that diffusion kernels on graphs in particular, and kernel functions in general are promising measures to facilitate parsing proteins into domains and performing different structural analysis on proteins. The size and interconnectedness of the protein graphs make them promising targets for diffusion kernels as measures of affinity between amino acids. The versatility of our method allows the implementation of future kernels with higher performance. The source code of the proposed method is accessible at https://github.com/taherimo/kludo . Also, the proposed method is available as a web application from https://cbph.ir/tools/kludo .
Collapse
Affiliation(s)
- Mohammad Taheri-Ledari
- Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Amirali Zandieh
- Department of Biophysics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Seyed Peyman Shariatpanahi
- Department of Biophysics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran. .,School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| |
Collapse
|
4
|
Sanchez-Pulido L, Ponting CP. Extending the Horizon of Homology Detection with Coevolution-based Structure Prediction. J Mol Biol 2021; 433:167106. [PMID: 34139218 PMCID: PMC8527833 DOI: 10.1016/j.jmb.2021.167106] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 06/09/2021] [Accepted: 06/09/2021] [Indexed: 12/12/2022]
Abstract
Traditional sequence analysis algorithms fail to identify distant homologies when they lie beyond a detection horizon. In this review, we discuss how co-evolution-based contact and distance prediction methods are pushing back this homology detection horizon, thereby yielding new functional insights and experimentally testable hypotheses. Based on correlated substitutions, these methods divine three-dimensional constraints among amino acids in protein sequences that were previously devoid of all annotated domains and repeats. The new algorithms discern hidden structure in an otherwise featureless sequence landscape. Their revelatory impact promises to be as profound as the use, by archaeologists, of ground-penetrating radar to discern long-hidden, subterranean structures. As examples of this, we describe how triplicated structures reflecting longin domains in MON1A-like proteins, or UVR-like repeats in DISC1, emerge from their predicted contact and distance maps. These methods also help to resolve structures that do not conform to a "beads-on-a-string" model of protein domains. In one such example, we describe CFAP298 whose ubiquitin-like domain was previously challenging to perceive owing to a large sequence insertion within it. More generally, the new algorithms permit an easier appreciation of domain families and folds whose evolution involved structural insertion or rearrangement. As we exemplify with α1-antitrypsin, coevolution-based predicted contacts may also yield insights into protein dynamics and conformational change. This new combination of structure prediction (using innovative co-evolution based methods) and homology inference (using more traditional sequence analysis approaches) shows great promise for bringing into view a sea of evolutionary relationships that had hitherto lain far beyond the horizon of homology detection.
Collapse
Affiliation(s)
- Luis Sanchez-Pulido
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| | - Chris P Ponting
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| |
Collapse
|
5
|
Abstract
Co-evolution techniques were originally conceived to assist in protein structure prediction by inferring pairs of residues that share spatial proximity. However, the functional relationships that can be extrapolated from co-evolution have also proven to be useful in a wide array of structural bioinformatics applications. These techniques are a powerful way to extract structural and functional information in a sequence-rich world.
Collapse
|
6
|
Postic G, Ghouzam Y, Chebrek R, Gelly JC. An ambiguity principle for assigning protein structural domains. SCIENCE ADVANCES 2017; 3:e1600552. [PMID: 28097215 PMCID: PMC5235333 DOI: 10.1126/sciadv.1600552] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 11/28/2016] [Indexed: 05/20/2023]
Abstract
Ambiguity is the quality of being open to several interpretations. For an image, it arises when the contained elements can be delimited in two or more distinct ways, which may cause confusion. We postulate that it also applies to the analysis of protein three-dimensional structure, which consists in dividing the molecule into subunits called domains. Because different definitions of what constitutes a domain can be used to partition a given structure, the same protein may have different but equally valid domain annotations. However, knowledge and experience generally displace our ability to accept more than one way to decompose the structure of an object-in this case, a protein. This human bias in structure analysis is particularly harmful because it leads to ignoring potential avenues of research. We present an automated method capable of producing multiple alternative decompositions of protein structure (web server and source code available at www.dsimb.inserm.fr/sword/). Our innovative algorithm assigns structural domains through the hierarchical merging of protein units, which are evolutionarily preserved substructures that describe protein architecture at an intermediate level, between domain and secondary structure. To validate the use of these protein units for decomposing protein structures into domains, we set up an extensive benchmark made of expert annotations of structural domains and including state-of-the-art domain parsing algorithms. The relevance of our "multipartitioning" approach is shown through numerous examples of applications covering protein function, evolution, folding, and structure prediction. Finally, we introduce a measure for the structural ambiguity of protein molecules.
Collapse
Affiliation(s)
- Guillaume Postic
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| | - Yassine Ghouzam
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Romain Chebrek
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Jean-Christophe Gelly
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| |
Collapse
|
7
|
CATH-Gene3D: Generation of the Resource and Its Use in Obtaining Structural and Functional Annotations for Protein Sequences. Methods Mol Biol 2017; 1558:79-110. [PMID: 28150234 DOI: 10.1007/978-1-4939-6783-4_4] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
This chapter describes the generation of the data in the CATH-Gene3D online resource and how it can be used to study protein domains and their evolutionary relationships. Methods will be presented for: comparing protein structures, recognizing homologs, predicting domain structures within protein sequences, and subclassifying superfamilies into functionally pure families, together with a guide on using the webpages.
Collapse
|
8
|
Xue Z, Jang R, Govindarajoo B, Huang Y, Wang Y. Extending Protein Domain Boundary Predictors to Detect Discontinuous Domains. PLoS One 2015; 10:e0141541. [PMID: 26502173 PMCID: PMC4621036 DOI: 10.1371/journal.pone.0141541] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 10/10/2015] [Indexed: 11/18/2022] Open
Abstract
A variety of protein domain predictors were developed to predict protein domain boundaries in recent years, but most of them cannot predict discontinuous domains. Considering nearly 40% of multidomain proteins contain one or more discontinuous domains, we have developed DomEx to enable domain boundary predictors to detect discontinuous domains by assembling the continuous domain segments. Discontinuous domains are predicted by matching the sequence profile of concatenated continuous domain segments with the profiles from a single-domain library derived from SCOP and CATH, and Pfam. Then the matches are filtered by similarity to library templates, a symmetric index score and a profile-profile alignment score. DomEx recalled 32.3% discontinuous domains with 86.5% precision when tested on 97 non-homologous protein chains containing 58 continuous and 99 discontinuous domains, in which the predicted domain segments are within ±20 residues of the boundary definitions in CATH 3.5. Compared with our recently developed predictor, ThreaDom, which is the state-of-the-art tool to detect discontinuous-domains, DomEx recalled 26.7% discontinuous domains with 72.7% precision in a benchmark with 29 discontinuous-domain chains, where ThreaDom failed to predict any discontinuous domains. Furthermore, combined with ThreaDom, the method ranked number one among 10 predictors. The source code and datasets are available at https://github.com/xuezhidong/DomEx.
Collapse
Affiliation(s)
- Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- * E-mail: (ZX); (YW)
| | - Richard Jang
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, United States of America
| | - Brandon Govindarajoo
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, United States of America
| | - Yichu Huang
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yan Wang
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
- * E-mail: (ZX); (YW)
| |
Collapse
|
9
|
The history of the CATH structural classification of protein domains. Biochimie 2015; 119:209-17. [PMID: 26253692 PMCID: PMC4678953 DOI: 10.1016/j.biochi.2015.08.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Accepted: 08/01/2015] [Indexed: 11/21/2022]
Abstract
This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. We present a historical review of the protein structure database CATH. We review the expansion of the CATH and SCOP resources with sequence data and functional annotations. How functional annotation resources allow insights into functional divergence and evolution within protein families.
Collapse
|
10
|
Wieninger SA, Ullmann GM. CoMoDo: Identifying Dynamic Protein Domains Based on Covariances of Motion. J Chem Theory Comput 2015; 11:2841-54. [DOI: 10.1021/acs.jctc.5b00150] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Silke A. Wieninger
- Structural Biology/Bioinformatics, University of Bayreuth, Universitätsstrasse 30, BGI, 95447 Bayreuth, Germany
| | - G. Matthias Ullmann
- Structural Biology/Bioinformatics, University of Bayreuth, Universitätsstrasse 30, BGI, 95447 Bayreuth, Germany
| |
Collapse
|
11
|
Ansari ES, Eslahchi C, Pezeshk H, Sadeghi M. ProDomAs, protein domain assignment algorithm using center-based clustering and independent dominating set. Proteins 2014; 82:1937-46. [PMID: 24596179 DOI: 10.1002/prot.24547] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Revised: 02/09/2014] [Accepted: 02/20/2014] [Indexed: 11/07/2022]
Abstract
Decomposition of structural domains is an essential task in classifying protein structures, predicting protein function, and many other proteomics problems. As the number of known protein structures in PDB grows exponentially, the need for accurate automatic domain decomposition methods becomes more essential. In this article, we introduce a bottom-up algorithm for assigning protein domains using a graph theoretical approach. This algorithm is based on a center-based clustering approach. For constructing initial clusters, members of an independent dominating set for the graph representation of a protein are considered as the centers. A distance matrix is then defined for these clusters. To obtain final domains, these clusters are merged using the compactness principle of domains and a method similar to the neighbor-joining algorithm considering some thresholds. The thresholds are computed using a training set consisting of 50 protein chains. The algorithm is implemented using C++ language and is named ProDomAs. To assess the performance of ProDomAs, its results are compared with seven automatic methods, against five publicly available benchmarks. The results show that ProDomAs outperforms other methods applied on the mentioned benchmarks. The performance of ProDomAs is also evaluated against 6342 chains obtained from ASTRAL SCOP 1.71. ProDomAs is freely available at http://www.bioinf.cs.ipm.ir/software/prodomas.
Collapse
Affiliation(s)
- Elnaz Saberi Ansari
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|
12
|
Skorupka K, Han SK, Nam HJ, Kim S, Faham S. Protein design by fusion: implications for protein structure prediction and evolution. ACTA CRYSTALLOGRAPHICA SECTION D: BIOLOGICAL CRYSTALLOGRAPHY 2013; 69:2451-60. [PMID: 24311586 DOI: 10.1107/s0907444913022701] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 08/12/2013] [Indexed: 01/21/2023]
Abstract
Domain fusion is a useful tool in protein design. Here, the structure of a fusion of the heterodimeric flagella-assembly proteins FliS and FliC is reported. Although the ability of the fusion protein to maintain the structure of the heterodimer may be apparent, threading-based structural predictions do not properly fuse the heterodimer. Additional examples of naturally occurring heterodimers that are homologous to full-length proteins were identified. These examples highlight that the designed protein was engineered by the same tools as used in the natural evolution of proteins and that heterodimeric structures contain a wealth of information, currently unused, that can improve structural predictions.
Collapse
Affiliation(s)
- Katarzyna Skorupka
- Department of Molecular Physiology and Biological Physics, University of Virginia School of Medicine, Charlottesville, VA 22093, USA
| | | | | | | | | |
Collapse
|
13
|
Seo S, Jang Y, Qian P, Liu WK, Choi JB, Lim BS, Kim MK. Efficient prediction of protein conformational pathways based on the hybrid elastic network model. J Mol Graph Model 2013; 47:25-36. [PMID: 24296313 DOI: 10.1016/j.jmgm.2013.10.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2013] [Revised: 10/19/2013] [Accepted: 10/22/2013] [Indexed: 11/18/2022]
Abstract
Various computational models have gained immense attention by analyzing the dynamic characteristics of proteins. Several models have achieved recognition by fulfilling either theoretical or experimental predictions. Nonetheless, each method possesses limitations, mostly in computational outlay and physical reality. These limitations remind us that a new model or paradigm should advance theoretical principles to elucidate more precisely the biological functions of a protein and should increase computational efficiency. With these critical caveats, we have developed a new computational tool that satisfies both physical reality and computational efficiency. In the proposed hybrid elastic network model (HENM), a protein structure is represented as a mixture of rigid clusters and point masses that are connected with linear springs. Harmonic analyses based on the HENM have been performed to generate normal modes and conformational pathways. The results of the hybrid normal mode analyses give new physical insight to the 70S ribosome. The feasibility of the conformational pathways of hybrid elastic network interpolation (HENI) was quantitatively evaluated by comparing three different overlap values proposed in this paper. A remarkable observation is that the obtained mode shapes and conformational pathways are consistent with each other. Our timing results show that HENM has some advantage in computational efficiency over a coarse-grained model, especially for large proteins, even though it takes longer to construct the HENM. Consequently, the proposed HENM will be one of the best alternatives to the conventional coarse-grained ENMs and all-atom based methods (such as molecular dynamics) without loss of physical reality.
Collapse
Affiliation(s)
- Sangjae Seo
- SKKU Advanced Institute of Nanotechnology, Sungkyunkwan University, Suwon 440-746, Republic of Korea
| | - Yunho Jang
- Department of Mechanical and Industrial Engineering, University of Massachusetts, Amherst, MA 01003, USA
| | - Pengfei Qian
- School of Mechanical Engineering, Sungkyunkwan University, Suwon 440-746, Republic of Korea
| | - Wing Kam Liu
- Department of Mechanical Engineering, Northwestern University, Evanston, IL 60208, USA
| | - Jae-Boong Choi
- SKKU Advanced Institute of Nanotechnology, Sungkyunkwan University, Suwon 440-746, Republic of Korea; School of Mechanical Engineering, Sungkyunkwan University, Suwon 440-746, Republic of Korea
| | - Byeong Soo Lim
- School of Mechanical Engineering, Sungkyunkwan University, Suwon 440-746, Republic of Korea
| | - Moon Ki Kim
- SKKU Advanced Institute of Nanotechnology, Sungkyunkwan University, Suwon 440-746, Republic of Korea; School of Mechanical Engineering, Sungkyunkwan University, Suwon 440-746, Republic of Korea.
| |
Collapse
|
14
|
Arab SS, Gharamaleki MP, Pashandi Z, Mobasseri R. Putracer: a novel method for identification of continuous-domains in multi-domain proteins. J Bioinform Comput Biol 2013; 11:1340012. [PMID: 23427994 DOI: 10.1142/s021972001340012x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
UNLABELLED Computer assisted assignment of protein domains is considered as an important issue in structural bioinformatics. The exponential increase in the number of known three dimensional protein structures and the significant role of proteins in biology, medicine and pharmacology illustrate the necessity of a reliable method to automatically detect structural domains as protein units. For this aim, we have developed a program based on the accessible surface area (ASA) and the hydrogen bonds energy in protein backbone (HBE). PUTracer (Protein Unit Tracer) is built on the features of a fast top-down approach to cut a chain into its domains (contiguous domains) with minimal change in ASA as well as HBE. Performance of the program was assessed by a comprehensive benchmark dataset of 124 protein chains, which is based on agreement among experts (e.g. CATH, SCOP) and was expanded to include structures with different types of domain combinations. Equal number of domains and at least 90% agreement in critical boundary accuracy were considered as correct assignment conditions. PUTracer assigned domains correctly in 81.45% of protein chains. Although low critical boundary accuracy in 18.55% of protein chains leads to the incorrect assignments, adjusting the scales causes to improve the performance up to 89.5%. We discuss here the success or failure of adjusting the scales with provided evidences. AVAILABILITY PUTracer is available at http://bioinf.modares.ac.ir/software/PUTracer/
Collapse
Affiliation(s)
- Seyed Shahriar Arab
- Department of Biophysics, Faculty of biological Sciences, Tarbiat Modares University-TMU, Tehran, Iran.
| | | | | | | |
Collapse
|
15
|
Ebina T, Umezawa Y, Kuroda Y. IS-Dom: a dataset of independent structural domains automatically delineated from protein structures. J Comput Aided Mol Des 2013; 27:419-26. [PMID: 23715893 DOI: 10.1007/s10822-013-9654-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 05/07/2013] [Indexed: 11/25/2022]
Abstract
Protein domains that can fold in isolation are significant targets in diverse area of proteomics research as they are often readily analyzed by high-throughput methods. Here, we report IS-Dom, a dataset of Independent Structural Domains (ISDs) that are most likely to fold in isolation. IS-Dom was constructed by filtering domains from SCOP, CATH, and DomainParser using quantitative structural measures, which were calculated by estimating inter-domain hydrophobic clusters and hydrogen bonds from the full length protein's atomic coordinates. The ISD detection protocol is fully automated, and all of the computed interactions are stored in the server which enables rapid update of IS-Dom. We also prepared a standard IS-Dom using parameters optimized by maximizing the Youden's index. The standard IS-Dom, contained 54,860 ISDs, of which 25.5 % had high sequence identity and termini overlap with a Protein Data Bank (PDB) cataloged sequence and are thus experimentally shown to fold in isolation [coined autonomously folded domain (AFDs)]. Furthermore, our ISD detection protocol missed less than 10 % of the AFDs, which corroborated our protocol's ability to define structural domains that are able to fold independently. IS-Dom is available through the web server ( http://domserv.lab.tuat.ac.jp/IS-Dom.html ), and users can either, download the standard IS-Dom dataset, construct their own IS-Dom by interactively varying the parameters, or assess the structural independence of newly defined putative domains.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo 184-8588, Japan.
| | | | | |
Collapse
|
16
|
Gomes M, Hamer R, Reinert G, Deane CM. Mutual information and variants for protein domain-domain contact prediction. BMC Res Notes 2012; 5:472. [PMID: 23244412 PMCID: PMC3532072 DOI: 10.1186/1756-0500-5-472] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Accepted: 08/10/2012] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Predicting protein contacts solely based on sequence information remains a challenging problem, despite the huge amount of sequence data at our disposal. Mutual Information (MI), an information theory measure, has been extensively employed and modified to identify residues within a protein (intra-protein) that are in contact. More recently MI and its variants have also been used in the prediction of contacts between proteins (inter-protein). METHODS Here we assess the predictive power of MI and variants for domain-domain contact prediction. We test original MI and these variants, which are called MIp, MIc and ZNMI, on 40 domain-domain test cases containing 10,753 sequences. We also propose and evaluate two new versions of MI that consider triangles of residues and the physiochemical properties of the amino acids, respectively. RESULTS We found that all versions of MI are skewed towards predicting surface residues. Since domain-domain contacts are on the surface of each domain, we considered only surface residues when attempting to predict contacts. Our analysis shows that MIc is the best current MI domain-domain contact predictor. At 20% recall MIc achieved a precision of 44.9% when only surface residues were considered. Our triangle and reduced alphabet variants of MI highlight the delicate trade-off between signal and noise in the use of MI for domain-domain contact prediction. We also examine a specific "successful" case study and demonstrate that here, when considering surface residues, even the most accurate domain-domain contact predictor, MIc, performs no better than random. CONCLUSIONS All tested variants of MI are skewed towards predicting surface residues. When considering surface residues only, we find MIc to be the best current MI domain-domain contact predictor. Its performance, however, is not as good as a non-MI based contact predictor, i-Patch. Additionally, the intra-protein contact prediction capabilities of MIc outperform its domain-domain contact prediction abilities.
Collapse
Affiliation(s)
- Mireille Gomes
- Department of Statistics, University of Oxford, Oxford, UK
| | | | | | | |
Collapse
|
17
|
Genoni A, Morra G, Colombo G. Identification of domains in protein structures from the analysis of intramolecular interactions. J Phys Chem B 2012; 116:3331-43. [PMID: 22384792 DOI: 10.1021/jp210568a] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The subdivision of protein structures into smaller and independent structural domains has a fundamental importance in understanding protein evolution and function and in the development of protein classification methods as well as in the interpretation of experimental data. Due to the rapid growth in the number of solved protein structures, the need for devising new accurate algorithmic methods has become more and more urgent. In this paper, we propose a new computational approach that is based on the concept of domain as a compact and independent folding unit and on the analysis of the residue-residue energy interactions obtainable through classical all-atom force field calculations. In particular, starting from the analysis of the nonbonded interaction energy matrix associated with a protein, our method filters out and selects only those specific subsets of interactions that define possible independent folding nuclei within a complex protein structure. This allows grouping different protein fragments into energy clusters that are found to correspond to structural domains. The strategy has been tested using proper benchmark data sets, and the results have shown that the new approach is fast and reliable in determining the number of domains in a totally ab initio manner and without making use of any training set or knowledge of the systems in exam. Moreover, our method, identifying the most relevant residues for the stabilization of each domain, may complement the results given by other classification techniques and may provide useful information to design and guide new experiments.
Collapse
Affiliation(s)
- Alessandro Genoni
- Istituto di Chimica del Riconoscimento Molecolare, CNR, Via Mario Bianco 9, 20131 Milano, Italy.
| | | | | |
Collapse
|
18
|
Abstract
The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.
Collapse
|
19
|
Flores SC, Gerstein MB. Predicting protein ligand binding motions with the conformation explorer. BMC Bioinformatics 2011; 12:417. [PMID: 22032721 PMCID: PMC3354956 DOI: 10.1186/1471-2105-12-417] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2011] [Accepted: 10/27/2011] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Knowledge of the structure of proteins bound to known or potential ligands is crucial for biological understanding and drug design. Often the 3D structure of the protein is available in some conformation, but binding the ligand of interest may involve a large scale conformational change which is difficult to predict with existing methods. RESULTS We describe how to generate ligand binding conformations of proteins that move by hinge bending, the largest class of motions. First, we predict the location of the hinge between domains. Second, we apply an Euler rotation to one of the domains about the hinge point. Third, we compute a short-time dynamical trajectory using Molecular Dynamics to equilibrate the protein and ligand and correct unnatural atomic positions. Fourth, we score the generated structures using a novel fitness function which favors closed or holo structures. By iterating the second through fourth steps we systematically minimize the fitness function, thus predicting the conformational change required for small ligand binding for five well studied proteins. CONCLUSIONS We demonstrate that the method in most cases successfully predicts the holo conformation given only an apo structure.
Collapse
Affiliation(s)
- Samuel C Flores
- Department of Cell and Molecular Biology, Uppsala University, BMC Box 596, Uppsala, 75124, Sweden
| | - Mark B Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114 MBB, New Haven, CT, 06520, USA
- Department of Computer Science, Yale University, PO Box 208114 MBB, New Haven, CT, 06520, USA
| |
Collapse
|
20
|
Hamer R, Luo Q, Armitage JP, Reinert G, Deane CM. i-Patch: interprotein contact prediction using local network information. Proteins 2011; 78:2781-97. [PMID: 20635422 DOI: 10.1002/prot.22792] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Biological processes are commonly controlled by precise protein-protein interactions. These connections rely on specific amino acids at the binding interfaces. Here we predict the binding residues of such interprotein complexes. We have developed a suite of methods, i-Patch, which predict the interprotein contact sites by considering the two proteins as a network, with residues as nodes and contacts as edges. i-Patch starts with two proteins, A and B, which are assumed to interact, but for which the structure of the complex is not available. However, we assume that for each protein, we have a reference structure and a multiple sequence alignment of homologues. i-Patch then uses the propensities of patches of residues to interact, to predict interprotein contact sites. i-Patch outperforms several other tested algorithms for prediction of interprotein contact sites. It gives 59% precision with 20% recall on a blind test set of 31 protein pairs. Combining the i-Patch scores with an existing correlated mutation algorithm, McBASC, using a logistic model gave little improvement. Results from a case study, on bacterial chemotaxis protein complexes, demonstrate that our predictions can identify contact residues, as well as suggesting unknown interfaces in multiprotein complexes.
Collapse
Affiliation(s)
- Rebecca Hamer
- Oxford Centre for Integrative Systems Biology, Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| | | | | | | | | |
Collapse
|
21
|
Esque J, Oguey C, de Brevern AG. Comparative Analysis of Threshold and Tessellation Methods for Determining Protein Contacts. J Chem Inf Model 2011; 51:493-507. [DOI: 10.1021/ci100195t] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Jeremy Esque
- LPTM, CNRS UMR 8089, Université de Cergy Pontoise, 2 av. Adolphe Chauvin, 95302 Cergy-Pontoise, France
- INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris Cedex 15, France
| | - Christophe Oguey
- LPTM, CNRS UMR 8089, Université de Cergy Pontoise, 2 av. Adolphe Chauvin, 95302 Cergy-Pontoise, France
| | - Alexandre G. de Brevern
- INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris Cedex 15, France
| |
Collapse
|
22
|
Tai CH, Sam V, Gibrat JF, Garnier J, Munson PJ, Lee B. Protein domain assignment from the recurrence of locally similar structures. Proteins 2010; 79:853-66. [PMID: 21287617 DOI: 10.1002/prot.22923] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2010] [Revised: 10/14/2010] [Accepted: 10/18/2010] [Indexed: 11/10/2022]
Abstract
Domains are basic units of protein structure and essential for exploring protein fold space and structure evolution. With the structural genomics initiative, the number of protein structures in the Protein Databank (PDB) is increasing dramatically and domain assignments need to be done automatically. Most existing structural domain assignment programs define domains using the compactness of the domains and/or the number and strength of intra-domain versus inter-domain contacts. Here we present a different approach based on the recurrence of locally similar structural pieces (LSSPs) found by one-against-all structure comparisons with a dataset of 6373 protein chains from the PDB. Residues of the query protein are clustered using LSSPs via three different procedures to define domains. This approach gives results that are comparable to several existing programs that use geometrical and other structural information explicitly. Remarkably, most of the proteins that contribute the LSSPs defining a domain do not themselves contain the domain of interest. This study shows that domains can be defined by a collection of relatively small locally similar structural pieces containing, on average, four secondary structure elements. In addition, it indicates that domains are indeed made of recurrent small structural pieces that are used to build protein structures of many different folds as suggested by recent studies.
Collapse
Affiliation(s)
- Chin-Hsien Tai
- Laboratory of Molecular Biology, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | | | | | | | | |
Collapse
|
23
|
He Z, Zhao Y, Mei G, Li N, Chen Y. Could protein tertiary structure influence mammary transgene expression more than tissue specific codon usage? Transgenic Res 2010; 19:519-33. [PMID: 20563642 PMCID: PMC2902731 DOI: 10.1007/s11248-010-9411-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Accepted: 05/19/2010] [Indexed: 12/03/2022]
Abstract
Animal mammary glands have been successfully employed to produce therapeutic recombinant human proteins. However, considerable variation in animal mammary transgene expression efficiency has been reported. We now consider whether aspects of codon usage and/or protein tertiary structure underlie this variation in mammary transgene expression.
Collapse
Affiliation(s)
- Zuyong He
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, 510006, Guangzhou, People's Republic of China
| | | | | | | | | |
Collapse
|
24
|
Keating KS, Flores SC, Gerstein MB, Kuhn LA. StoneHinge: hinge prediction by network analysis of individual protein structures. Protein Sci 2009; 18:359-71. [PMID: 19180449 DOI: 10.1002/pro.38] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Hinge motions are important for molecular recognition, and knowledge of their location can guide the sampling of protein conformations for docking. Predicting domains and intervening hinges is also important for identifying structurally self-determinate units and anticipating the influence of mutations on protein flexibility and stability. Here we present StoneHinge, a novel approach for predicting hinges between domains using input from two complementary analyses of noncovalent bond networks: StoneHingeP, which identifies domain-hinge-domain signatures in ProFlex constraint counting results, and StoneHingeD, which does the same for DomDecomp Gaussian network analyses. Predictions for the two methods are compared to hinges defined in the literature and by visual inspection of interpolated motions between conformations in a series of proteins. For StoneHingeP, all the predicted hinges agree with hinge sites reported in the literature or observed visually, although some predictions include extra residues. Furthermore, no hinges are predicted in six hinge-free proteins. On the other hand, StoneHingeD tends to overpredict the number of hinges, while accurately pinpointing hinge locations. By determining the consensus of their results, StoneHinge improves the specificity, predicting 11 of 13 hinges found both visually and in the literature for nine different open protein structures, and making no false-positive predictions. By comparison, a popular hinge detection method that requires knowledge of both the open and closed conformations finds 10 of the 13 known hinges, while predicting four additional, false hinges.
Collapse
Affiliation(s)
- Kevin S Keating
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | | | | | | |
Collapse
|
25
|
Bruylants G, Redfield C. (15)N NMR relaxation data reveal significant chemical exchange broadening in the alpha-domain of human alpha-lactalbumin. Biochemistry 2009; 48:4031-9. [PMID: 19309110 DOI: 10.1021/bi900023m] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Human alpha-lactalbumin (alpha-LA), a 123-residue calcium-binding protein, has been studied using (15)N NMR relaxation methods in order to characterize backbone dynamics of the native state at the level of individual residues. Relaxation data were collected at three magnetic field strengths and analyzed using the model-free formalism of Lipari and Szabo. The order parameters derived from this analysis are generally high, indicating a rigid backbone. A total of 46 residues required an exchange contribution to T(2); 43 of these residues are located in the alpha-domain of the protein. The largest exchange contributions are observed in the A-, B-, D-, and C-terminal 3(10)-helices of the alpha-domain; these residues have been shown previously to form a highly stable core in the alpha-LA molten globule. The observed exchange broadening, along with previous hydrogen/deuterium amide exchange data, suggests that this part of the alpha-domain may undergo a local structural transition between the well-ordered native structure and a less-ordered molten-globule-like structure.
Collapse
Affiliation(s)
- Gilles Bruylants
- Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, United Kingdom
| | | |
Collapse
|
26
|
Faure G, Bornot A, de Brevern AG. Analysis of protein contacts into Protein Units. Biochimie 2009; 91:876-87. [PMID: 19383526 DOI: 10.1016/j.biochi.2009.04.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2008] [Accepted: 04/13/2009] [Indexed: 11/18/2022]
Abstract
Three-dimensional structures of proteins are the support of their biological functions. Their folds are maintained by inter-residue interactions which are one of the main focuses to understand the mechanisms of protein folding and stability. Furthermore, protein structures can be composed of single or multiple functional domains that can fold and function independently. Hence, dividing a protein into domains is useful for obtaining an accurate structure and function determination. In previous studies, we enlightened protein contact properties according to different definitions and developed a novel methodology named Protein Peeling. Within protein structures, Protein Peeling characterizes small successive compact units along the sequence called protein units (PUs). The cutting done by Protein Peeling maximizes the number of contacts within the PUs and minimizes the number of contacts between them. This method is so a relevant tool in the context of the protein folding research and particularly regarding the hierarchical model proposed by George Rose. Here, we accurately analyze the PUs at different levels of cutting, using a non-redundant protein databank. Distribution of PU sizes, number of PUs or their accessibility are screened to determine their common and different features. Moreover, we highlight the preferential amino acid interactions inside and between PUs. Our results show that PUs are clearly an intermediate level between secondary structures and protein structural domains.
Collapse
Affiliation(s)
- Guilhem Faure
- INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire (EBGM), DSIMB, Université Paris Diderot - Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France
| | | | | |
Collapse
|
27
|
Shi S, Pei J, Sadreyev RI, Kinch LN, Majumdar I, Tong J, Cheng H, Kim BH, Grishin NV. Analysis of CASP8 targets, predictions and assessment methods. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2009; 2009:bap003. [PMID: 20157476 PMCID: PMC2794793 DOI: 10.1093/database/bap003] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/27/2009] [Accepted: 02/21/2009] [Indexed: 11/17/2022]
Abstract
Results of the recent Critical Assessment of Techniques for Protein Structure Prediction, CASP8, present several valuable sources of information. First, CASP targets comprise a realistic sample of currently solved protein structures and exemplify the corresponding challenges for predictors. Second, the plethora of predictions by all possible methods provides an unusually rich material for evolutionary analysis of target proteins. Third, CASP results show the current state of the field and highlight specific problems in both predicting and assessing. Finally, these data can serve as grounds to develop and analyze methods for assessing prediction quality. Here we present results of our analysis in these areas. Our objective is not to duplicate CASP assessment, but to use our unique experience as former CASP5 assessors and CASP8 predictors to (i) offer more insights into CASP targets and predictions based on expert analysis, including invaluable analysis prior to target structure release; and (ii) develop an assessment methodology tailored towards current challenges in the field. Specifically, we discuss preparing target structures for assessment, parsing protein domains, balancing evaluations based on domains and on whole chains, dividing targets into categories and developing new evaluation scores. We also present evolutionary analysis of the most interesting and challenging targets. Database URL: Our results are available as a comprehensive database of targets and predictions at http://prodata.swmed.edu/CASP8.
Collapse
Affiliation(s)
- Shuoyong Shi
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Majumdar I, Kinch LN, Grishin NV. A database of domain definitions for proteins with complex interdomain geometry. PLoS One 2009; 4:e5084. [PMID: 19352501 PMCID: PMC2662426 DOI: 10.1371/journal.pone.0005084] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2009] [Accepted: 03/10/2009] [Indexed: 11/18/2022] Open
Abstract
Protein structural domains are necessary for understanding evolution and protein folding, and may vary widely from functional and sequence based domains. Although, various structural domain databases exist, defining domains for some proteins is non-trivial, and definitions of their domain boundaries are not available. Here, we present a novel database of manually defined structural domains for a representative set of proteins from the SCOP “multi-domain proteins” class. (http://prodata.swmed.edu/multidom/). We consider our domains as mobile evolutionary units, which may rearrange during protein evolution. Additionally, they may be visualized as structurally compact and possibly independently folding units. We also found that representing domains as evolutionary and folding units do not always lead to a unique domain definition. However, unlike existing databases, we retain and refine these “alternate” domain definitions after careful inspection of structural similarity, functional sites and automated domain definition methods. We provide domain definitions, including actual residue boundaries, for proteins that well known databases like SCOP and CATH do not attempt to split. Our alternate domain definitions are suitable for sequence and structure searches by automated methods. Additionally, the database can be used for training and testing domain delineation algorithms. Since our domains represent structurally compact evolutionary units, the database may be useful for studying domain properties and evolution.
Collapse
Affiliation(s)
- Indraneel Majumdar
- Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- * E-mail: (IM); (NVG)
| | - Lisa N. Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, United States of America
- * E-mail: (IM); (NVG)
| |
Collapse
|
29
|
Liu Q, Huang J, Liu H, Wan P, Ye X, Xu Y. Analyses of domains and domain fusions in human proto-oncogenes. BMC Bioinformatics 2009; 10:88. [PMID: 19292927 PMCID: PMC2679021 DOI: 10.1186/1471-2105-10-88] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2008] [Accepted: 03/17/2009] [Indexed: 11/18/2022] Open
Abstract
Background Understanding the constituent domains of oncogenes, their origins and their fusions may shed new light about the initiation and the development of cancers. Results We have developed a computational pipeline for identification of functional domains of human genes, prediction of the origins of these domains and their major fusion events during evolution through integration of existing and new tools of our own. An application of the pipeline to 124 well-characterized human oncogenes has led to the identification of a collection of domains and domain pairs that occur substantially more frequently in oncogenes than in human genes on average. Most of these enriched domains and domain pairs are related to tyrosine kinase activities. In addition, our analyses indicate that a substantial portion of the domain-fusion events of oncogenes took place in metazoans during evolution. Conclusion We expect that the computational pipeline for domain identification, domain origin and domain fusion prediction will prove to be useful for studying other groups of genes.
Collapse
Affiliation(s)
- Qi Liu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
| | | | | | | | | | | |
Collapse
|
30
|
Sikder AR, Zomaya AY. Inferring boundary information of discontinuous-domain proteins. IEEE Trans Nanobioscience 2008; 7:200-5. [PMID: 18779100 DOI: 10.1109/tnb.2008.2002283] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Wetlaufer introduced the classification of domains into continuous and discontinuous. Continuous domains form from a single-chain segment and discontinuous domains are composed of two or more chain segments. Richardson identified approximately 100 domains in her review. Her assignment was based on the concepts that the domain would be independently stable and/or could undergo rigid-body-like movements with respect to the entire protein. There are now several instances where structurally similar domains occur in different proteins in the absence of noticeable sequence similarity. Possibly, the most notable of such domains is the trios-phosphate isomerase (TIM) barrel. With the increase in the number of known sequences, computer algorithms are required to identify the discontinuous domain of an unknown protein chain in order to determine its structure and function. We have developed a novel algorithm for discontinuous-domain boundary prediction based on a machine learning algorithm and interresidue contact interactions values. We have used 415 proteins, including 100 discontinuous-domain chains for training. There is no method available that is designed solely on a sequence based for the prediction of discontinuous domain. DomainDiscovery performed significantly well compared to the structure-based methods like structural classification of proteins (SCOP), class, architecture, topology and homologous superfamily (CATH), and DOMain MAKer (DOMAK).
Collapse
Affiliation(s)
- Abdur R Sikder
- International Computer Science Institute, University of California, Berkeley, CA 94704 USA.
| | | |
Collapse
|
31
|
Faure G, Bornot A, de Brevern AG. Protein contacts, inter-residue interactions and side-chain modelling. Biochimie 2008; 90:626-39. [DOI: 10.1016/j.biochi.2007.11.007] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 11/22/2007] [Indexed: 10/22/2022]
|
32
|
Abstract
Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. Searching databases is often the first step in the study of a new protein. Comparison between proteins and between protein families in databases provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of proteins. Although some protein databases are widely known, they are far from being fully utilized in the protein science community. This unit provides a starting point for readers to explore the potential of protein databases on the Internet.
Collapse
Affiliation(s)
- Dong Xu
- Digital Biology Laboratory, University of Missouri-Columbia, Columbia, Missouri, USA
| | | |
Collapse
|
33
|
Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA. CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 2008; 3:e232. [PMID: 18052539 PMCID: PMC2098860 DOI: 10.1371/journal.pcbi.0030232] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2007] [Accepted: 10/11/2007] [Indexed: 11/19/2022] Open
Abstract
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification. Proteins comprise individual folding units known as domains, with a significant proportion containing two or more (multidomain structures). Each domain is thought to represent a unit of evolution and adopts a specific fold. Detecting domains is often the first step in classifying proteins into evolutionary families for studying the relationship between sequence, structure, and function. Automatically identifying domains from structural data is problematic due to the fact that domains vary substantially in their compactness and geometric separation from one another in the whole protein. We present a novel method, CATHEDRAL, which iteratively identifies each domain by comparing a query structure against a library of manually verified domains in the CATH domain database through computational structure comparison. We find that CATHEDRAL is able to outperform the majority of popular structure comparison methods for finding structural relatives. Furthermore, it is able to accurately identify domain boundaries and outperform other methods of structure-based domain prediction for the majority of proteins. CATHEDRAL is available as a Webserver to provide domain annotations for the community and hence aid in structural and functional characterisation of newly solved protein structures.
Collapse
Affiliation(s)
- Oliver C Redfern
- Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
| | | | | | | | | |
Collapse
|
34
|
Abstract
Domains are considered to be the building blocks of protein structures. A protein can contain a single domain or multiple domains, each one typically associated with a specific function. The combination of domains determines the function of the protein, its subcellular localization and the interactions it is involved in. Determining the domain structure of a protein is important for multiple reasons, including protein function analysis and structure prediction. This chapter reviews the different approaches for domain prediction and discusses lessons learned from the application of these methods.
Collapse
Affiliation(s)
- Helgi Ingolfsson
- Department of Physiology and Biophysics, Weill Medical College of Cornell University, Ithaca, NY, USA
| | | |
Collapse
|
35
|
Russell RB. Classification of protein folds. Mol Biotechnol 2007; 36:238-47. [PMID: 17873410 DOI: 10.1007/s12033-007-0032-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/1999] [Revised: 11/30/1999] [Accepted: 11/30/1999] [Indexed: 11/26/2022]
Abstract
The diversity and complexity of bioinformatics tools currently available for protein sequence analysis can make it difficult to know where to begin when presented with a new sequence. In this article, we present a protocol outlining one approach to sequence analysis that should give as comprehensive a picture as possible as to the likely structure and function of a protein given the limits of available tools. We also provide worked examples showing how these tools can have an impact on the understanding of protein function prior to experimental studies.
Collapse
Affiliation(s)
- Robert B Russell
- Structural Bioinformatics, EMBL, Meyerhofstrasse 1, Heidelberg, Germany.
| |
Collapse
|
36
|
Emmert-Streib F, Mushegian A. A topological algorithm for identification of structural domains of proteins. BMC Bioinformatics 2007; 8:237. [PMID: 17608939 PMCID: PMC1933582 DOI: 10.1186/1471-2105-8-237] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2007] [Accepted: 07/03/2007] [Indexed: 11/10/2022] Open
Abstract
Background Identification of the structural domains of proteins is important for our understanding of the organizational principles and mechanisms of protein folding, and for insights into protein function and evolution. Algorithmic methods of dissecting protein of known structure into domains developed so far are based on an examination of multiple geometrical, physical and topological features. Successful as many of these approaches are, they employ a lot of heuristics, and it is not clear whether they illuminate any deep underlying principles of protein domain organization. Other well-performing domain dissection methods rely on comparative sequence analysis. These methods are applicable to sequences with known and unknown structure alike, and their success highlights a fundamental principle of protein modularity, but this does not directly improve our understanding of protein spatial structure. Results We present a novel graph-theoretical algorithm for the identification of domains in proteins with known three-dimensional structure. We represent the protein structure as an undirected, unweighted and unlabeled graph whose nodes correspond to the secondary structure elements and edges represent physical proximity of at least one pair of alpha carbon atoms from two elements. Domains are identified as constrained partitions of the graph, corresponding to sets of vertices obtained by the maximization of the cycle distributions found in the graph. When a partition is found, the algorithm is iteratively applied to each of the resulting subgraphs. The decision to accept or reject a tentative cut position is based on a specific classifier. The algorithm is applied iteratively to each of the resulting subgraphs and terminates automatically if partitions are no longer accepted. The distribution of cycles is the only type of information on which the decision about protein dissection is based. Despite the barebone simplicity of the approach, our algorithm approaches the best heuristic algorithms in accuracy. Conclusion Our graph-theoretical algorithm uses only topological information present in the protein structure itself to find the domains and does not rely on any geometrical or physical information about protein molecule. Perhaps unexpectedly, these drastic constraints on resources, which result in a seemingly approximate description of protein structures and leave only a handful of parameters available for analysis, do not lead to any significant deterioration of algorithm accuracy. It appears that protein structures can be rigorously treated as topological rather than geometrical objects and that the majority of information about protein domains can be inferred from the coarse-grained measure of pairwise proximity between elements of secondary structure elements.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO 64110, USA
- University of Washington, 1705 NE Pacific St, Box 355065, Seattle WA 98195-5065, USA
| | - Arcady Mushegian
- Stowers Institute for Medical Research, 1000 E. 50th Street, Kansas City, MO 64110, USA
- University of Kansas Medical Center, Kansas City, KS 66160, USA
| |
Collapse
|
37
|
Zhou H, Xue B, Zhou Y. DDOMAIN: Dividing structures into domains using a normalized domain-domain interaction profile. Protein Sci 2007; 16:947-55. [PMID: 17456745 PMCID: PMC2206635 DOI: 10.1110/ps.062597307] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Dividing protein structures into domains is proven useful for more accurate structural and functional characterization of proteins. Here, we develop a method, called DDOMAIN, that divides structure into DOMAINs using a normalized contact-based domain-domain interaction profile. Results of DDOMAIN are compared to AUTHORS annotations (domain definitions are given by the authors who solved protein structures), as well as to popular SCOP and CATH annotations by human experts and automatic programs. DDOMAIN's automatic annotations are most consistent with the AUTHORS annotations (90% agreement in number of domains and 88% agreement in both number of domains and at least 85% overlap in domain assignment of residues) if its three adjustable parameters are trained by the AUTHORS annotations. By comparison, the agreement is 83% (81% with at least 85% overlap criterion) between SCOP-trained DDOMAIN and SCOP annotations and 77% (73%) between CATH-trained DDOMAIN and CATH annotations. The agreement between DDOMAIN and AUTHORS annotations goes beyond single-domain proteins (97%, 82%, and 56% for single-, two-, and three-domain proteins, respectively). For an "easy" data set of proteins whose CATH and SCOP annotations agree with each other in number of domains, the agreement is 90% (89%) between "easy-set"-trained DDOMAIN and CATH/SCOP annotations. The consistency between SCOP-trained DDOMAIN and SCOP annotations is superior to two other recently developed, SCOP-trained, automatic methods PDP (protein domain parser), and DomainParser 2. We also tested a simple consensus method made of PDP, DomainParser 2, and DDOMAIN and a different version of DDOMAIN based on a more sophisticated statistical energy function. The DDOMAIN server and its executable are available in the services section on http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Hongyi Zhou
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York at Buffalo, Buffalo, New York 14214, USA
| | | | | |
Collapse
|
38
|
FlexOracle: predicting flexible hinges by identification of stable domains. BMC Bioinformatics 2007; 8:215. [PMID: 17587456 PMCID: PMC1933439 DOI: 10.1186/1471-2105-8-215] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2006] [Accepted: 06/22/2007] [Indexed: 11/28/2022] Open
Abstract
Background Protein motions play an essential role in catalysis and protein-ligand interactions, but are difficult to observe directly. A substantial fraction of protein motions involve hinge bending. For these proteins, the accurate identification of flexible hinges connecting rigid domains would provide significant insight into motion. Programs such as GNM and FIRST have made global flexibility predictions available at low computational cost, but are not designed specifically for finding hinge points. Results Here we present the novel FlexOracle hinge prediction approach based on the ideas that energetic interactions are stronger within structural domains than between them, and that fragments generated by cleaving the protein at the hinge site are independently stable. We implement this as a tool within the Database of Macromolecular Motions, MolMovDB.org. For a given structure, we generate pairs of fragments based on scanning all possible cleavage points on the protein chain, compute the energy of the fragments compared with the undivided protein, and predict hinges where this quantity is minimal. We present three specific implementations of this approach. In the first, we consider only pairs of fragments generated by cutting at a single location on the protein chain and then use a standard molecular mechanics force field to calculate the enthalpies of the two fragments. In the second, we generate fragments in the same way but instead compute their free energies using a knowledge based force field. In the third, we generate fragment pairs by cutting at two points on the protein chain and then calculate their free energies. Conclusion Quantitative results demonstrate our method's ability to predict known hinges from the Database of Macromolecular Motions.
Collapse
|
39
|
Hinge Atlas: relating protein sequence to sites of structural flexibility. BMC Bioinformatics 2007; 8:167. [PMID: 17519025 PMCID: PMC1913541 DOI: 10.1186/1471-2105-8-167] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2006] [Accepted: 05/22/2007] [Indexed: 12/03/2022] Open
Abstract
Background Relating features of protein sequences to structural hinges is important for identifying domain boundaries, understanding structure-function relationships, and designing flexibility into proteins. Efforts in this field have been hampered by the lack of a proper dataset for studying characteristics of hinges. Results Using the Molecular Motions Database we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges. Conclusion We found various correlations between hinges and sequence features. Some of these are expected; for instance, we found that hinges tend to occur on the surface and in coils and turns and to be enriched with small and hydrophilic residues. Others are less obvious and intuitive. In particular, we found that hinges tend to coincide with active sites, but unlike the latter they are not at all conserved in evolution. We evaluate the potential for hinge prediction based on sequence. Motions play an important role in catalysis and protein-ligand interactions. Hinge bending motions comprise the largest class of known motions. Therefore it is important to relate the hinge location to sequence features such as residue type, physicochemical class, secondary structure, solvent exposure, evolutionary conservation, and proximity to active sites. To do this, we first generated the Hinge Atlas, a set of protein motions with the hinge locations manually annotated, and then studied the coincidence of these features with the hinge location. We found that all of the features have bearing on the hinge location. Most interestingly, we found that hinges tend to occur at or near active sites and yet unlike the latter are not conserved. Less surprisingly, we found that hinge residues tend to be small, not hydrophobic or aliphatic, and occur in turns and random coils on the surface. A functional sequence based hinge predictor was made which uses some of the data generated in this study. The Hinge Atlas is made available to the community for further flexibility studies.
Collapse
|
40
|
Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2006; 35:D291-7. [PMID: 17135200 PMCID: PMC1751535 DOI: 10.1093/nar/gkl959] [Citation(s) in RCA: 239] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt.
Collapse
Affiliation(s)
| | | | | | - Alison Cuff
- To whom correspondence should be addressed: Tel: +1 44 207 679 3890; Fax: +1 44 207 679 7193;
| | | | | | | | | | | | | | | | | | - Janet M. Thornton
- European Bioinformatics Institute, Hinxton HallHinxton, Cambridge CB 10 IRQ, UK
| | | |
Collapse
|
41
|
Tanaka T, Yokoyama S, Kuroda Y. Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. Biopolymers 2006; 84:161-8. [PMID: 16134173 DOI: 10.1002/bip.20361] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Protein dissection into structural domains that can fold in isolation is an important issue in both functional and structural proteomics. Here, we analyzed inter- and intradomain loop sequences (respectively named domain linker and nonlinker loops) and computed a domain linker likelihood score, which was used for developing a domain boundary prediction protocol. The analysis confirmed our previous results indicating that the amino acid composition in terms of glycine, proline, aspartic acid, asparagine, lysine, and histidine significantly differs between linker and nonlinker loops. However, a detailed examination revealed that the amino acid composition bias actually depends on the loop length. Indeed, significant frequency deviations were observed for glycine, proline, and aspartic acid in short linker and nonlinker loops, whereas deviations were observed for aspartic acid, proline, asparagine, and lysine in long linker and nonlinker loops. Finally, we incorporated this loop-length-dependent amino acid composition bias in a simple linker prediction protocol, which predicted linkers with a 40.6% specificity and a 36.1% sensitivity. These figures are 4.4 and 2.4% higher than those obtained with our former prediction protocol that does not incorporate loop-length-dependent characteristics. This result should have practical significance for experimental protein dissection, since the probability of obtaining a stably folding structural domain by randomly dissecting a protein sequence is estimated to be 12.6%.
Collapse
Affiliation(s)
- Takanori Tanaka
- Department of Biophysics and Biochemistry, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
| | | | | |
Collapse
|
42
|
Kundu S, Sorensen DC, Phillips GN. Automatic domain decomposition of proteins by a Gaussian Network Model. Proteins 2006; 57:725-33. [PMID: 15478120 DOI: 10.1002/prot.20268] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Proteins are often comprised of domains of apparently independent folding units. These domains can be defined in various ways, but one useful definition divides the protein into substructures that seem to move more or less independently. The same methods that allow fairly accurate calculation of motion can be used to help classify these substructures. We show how the Gaussian Network Model (GNM), commonly used for determining motion, can also be adapted to automatically classify domains in proteins. Parallels between this physical network model and graph theory implementation are apparent. The method is applied to a nonredundant set of 55 proteins, and the results are compared to the visual assignments by crystallographers. Apart from decomposing proteins into structural domains, the algorithm can generally be applied to any large macromolecular system to decompose it into motionally decoupled sub-systems.
Collapse
Affiliation(s)
- Sibsankar Kundu
- Department of Biochemistry, University of Wisconsin, Madison, Wisconsin 53706, USA
| | | | | |
Collapse
|
43
|
Sistla RK, K V B, Vishveshwara S. Identification of domains and domain interface residues in multidomain proteins from graph spectral method. Proteins 2006; 59:616-26. [PMID: 15789418 DOI: 10.1002/prot.20444] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We present a novel method for the identification of structural domains and domain interface residues in proteins by graph spectral method. This method converts the three-dimensional structure of the protein into a graph by using atomic coordinates from the PDB file. Domain definitions are obtained by constructing either a protein backbone graph or a protein side-chain graph. The graph is constructed based on the interactions between amino acid residues in the three-dimensional structure of the proteins. The spectral parameters of such a graph contain information regarding the domains and subdomains in the protein structure. This is based on the fact that the interactions among amino acids are higher within a domain than across domains. This is evident in the spectra of the protein backbone and the side-chain graphs, thus differentiating the structural domains from one another. Further, residues that occur at the interface of two domains can also be easily identified from the spectra. This method is simple, elegant, and robust. Moreover, a single numeric computation yields both the domain definitions and the interface residues.
Collapse
Affiliation(s)
- Ramesh K Sistla
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | | | | |
Collapse
|
44
|
Galzitskaya OV, Dovidchenko NV, Lobanov MY, Garbuzynskiy SO. Prediction of protein domain boundaries from statistics of appearance of amino acid residues. Mol Biol 2006. [DOI: 10.1134/s0026893306010146] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
45
|
Sapranauskas R, Lubys A. Random gene dissection: a tool for the investigation of protein structural organization. Biotechniques 2005; 39:395-402. [PMID: 16206911 DOI: 10.2144/05393rr01] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
To investigate the domain structure of proteins and the function of individual domains, proteins are usually subjected to limited proteolysis, followed by isolation of protein fragments and determination of their functions. We have developed an approach we call random gene dissection (RGD) for the identification of functional protein domains and their interdomain regions as well as their in vivo complementing fragments. The approach was tested on a two-domain protein, the type IIS restriction endonuclease BfiI. The collection of BfiI insertional mutants was screened for those that are endonucleolytically active and thus induce the SOS DNA repair response. Sixteen isolated mutants of the wild-type specificity contained insertions that were dispersed in a relatively large region of the target recognition domain. They split the gene into two complementing parts that separately were unable to induce the SOS DNA repair response. In contrast, all 19 mutants of relaxed specificity contained the cassette inserted into a very narrow interdomain region that connects BfiI domains responsible for DNA recognition and for cleavage. As expected, only the N-terminal fragment of BfiI was required to induce SOS response. Our results demonstrate that RGD can be used as a general method to identify complementing fragments and functional domains in enzymes.
Collapse
|
46
|
Simon K, Xu J, Kim C, Skrynnikov NR. Estimating the accuracy of protein structures using residual dipolar couplings. JOURNAL OF BIOMOLECULAR NMR 2005; 33:83-93. [PMID: 16258827 DOI: 10.1007/s10858-005-2601-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2005] [Accepted: 08/05/2005] [Indexed: 05/05/2023]
Abstract
It has been commonly recognized that residual dipolar coupling data provide a measure of quality for protein structures. To quantify this observation, a database of 100 single-domain proteins has been compiled where each protein was represented by two independently solved structures. Backbone 1H-15N dipolar couplings were simulated for the target structures and then fitted to the model structures. The fits were characterized by an R-factor which was corrected for the effects of non-uniform distribution of dipolar vectors on a unit sphere. The analyses show that favorable R values virtually guarantee high accuracy of the model structure (where accuracy is defined as the backbone coordinate rms deviation). On the other hand, unfavorable R values do not necessarily suggest low accuracy. Based on the simulated data, a simple empirical formula is proposed to estimate the accuracy of protein structures. The method is illustrated with a number of examples, including PDZ2 domain of human phosphatase hPTP1E.
Collapse
Affiliation(s)
- Katya Simon
- Department of Chemistry, Purdue University, West Lafayette, IN 47907, USA
| | | | | | | |
Collapse
|
47
|
Dumontier M, Yao R, Feldman HJ, Hogue CWV. Armadillo: domain boundary prediction by amino acid composition. J Mol Biol 2005; 350:1061-73. [PMID: 15978619 DOI: 10.1016/j.jmb.2005.05.037] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2004] [Revised: 05/16/2005] [Accepted: 05/18/2005] [Indexed: 11/25/2022]
Abstract
The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (http://armadillo.blueprint.org), uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.
Collapse
Affiliation(s)
- Michel Dumontier
- Department of Biochemistry, University of Toronto, Toronto, Ont., Canada M5S 1A8
| | | | | | | |
Collapse
|
48
|
Abstract
We can now assign about two thirds of the sequences from completed genomes to as few as 1400 domain families for which structures are known and thus more ancient evolutionary relationships established. About 200 of these domain families are common to all kingdoms of life and account for nearly 50% of domain structure annotations in the genomes. Some of these domain families have been very extensively duplicated within a genome and combined with different domain partners giving rise to different multidomain proteins. The ways in which these domain combinations evolve tend to be specific to the organism so that less than 15% of the protein families found within a genome appear to be common to all kingdoms of life. Recent analyses of completed genomes, exploiting the structural data, have revealed the extent to which duplication of these domains and modifications of their functions can expand the functional repertoire of the organism, contributing to increasing complexity.
Collapse
Affiliation(s)
- Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College, London WC1E 6BT, United Kingdom.
| | | |
Collapse
|
49
|
Bae K, Mallick BK, Elsik CG. Prediction of protein interdomain linker regions by a hidden Markov model. Bioinformatics 2005; 21:2264-70. [PMID: 15746283 DOI: 10.1093/bioinformatics/bti363] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output. RESULTS We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.
Collapse
Affiliation(s)
- Kyounghwa Bae
- Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA
| | | | | |
Collapse
|
50
|
Abstract
The normal modes of a molecule are utilized, in conjunction with classical conformal vector field theory, to define a function that measures the capability of the molecule to deform at each of its residues. An efficient algorithm is presented to calculate the local chain deformability from the set of normal modes of vibration. This is done by considering each mode as an off-grid sample of a deformation vector field. Predictions of deformability are compared with experimental data in the form of dihedral angle differences between two conformations of ten kinases by using a modified correlation function. Deformability calculations correlate well with experimental results and validate the applicability of this method to protein flexibility predictions.
Collapse
Affiliation(s)
- Julio A Kovacs
- Department of Molecular Biology, The Scripps Research Institute La Jolla, California 92037, USA.
| | | | | |
Collapse
|