51
|
Clark NJ, Raththagala M, Wright NT, Buenger EA, Schildbach JF, Krueger S, Curtis JE. Structures of TraI in solution. J Mol Model 2014; 20:2308. [PMID: 24898939 DOI: 10.1007/s00894-014-2308-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2014] [Accepted: 05/12/2014] [Indexed: 10/25/2022]
Abstract
Bacterial conjugation, a DNA transfer mechanism involving transport of one plasmid strand from donor to recipient, is driven by plasmid-encoded proteins. The F TraI protein nicks one F plasmid strand, separates cut and uncut strands, and pilots the cut strand through a secretion pore into the recipient. TraI is a modular protein with identifiable nickase, ssDNA-binding, helicase and protein-protein interaction domains. While domain structures corresponding to roughly 1/3 of TraI have been determined, there has been no comprehensive structural study of the entire TraI molecule, nor an examination of structural changes to TraI upon binding DNA. Here, we combine solution studies using small-angle scattering and circular dichroism spectroscopy with molecular Monte Carlo and molecular dynamics simulations to assess solution behavior of individual and groups of domains. Despite having several long (>100 residues) apparently disordered or highly dynamic regions, TraI folds into a compact molecule. Based on the biophysical characterization, we have generated models of intact TraI. These data and the resulting models have provided clues to the regulation of TraI function.
Collapse
Affiliation(s)
- Nicholas J Clark
- NIST Center for Neutron Research, National Institute of Standards and Technology, 100 Bureau Drive, Mail Stop 6102, Gaithersburg, MD, 20899, USA
| | | | | | | | | | | | | |
Collapse
|
52
|
Ali H, Urolagin S, Gurarslan Ö, Vihinen M. Performance of Protein Disorder Prediction Programs on Amino Acid Substitutions. Hum Mutat 2014; 35:794-804. [DOI: 10.1002/humu.22564] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2013] [Accepted: 04/04/2014] [Indexed: 01/04/2023]
Affiliation(s)
- Heidi Ali
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
| | - Siddhaling Urolagin
- Department of Experimental Medical Science; Lund University; SE-22184 Lund Sweden
| | - Ömer Gurarslan
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
| | - Mauno Vihinen
- Institute of Biomedical Technology; FI-33014 University of Tampere; Tampere Finland
- BioMediTech; Tampere Finland
- Department of Experimental Medical Science; Lund University; SE-22184 Lund Sweden
- Tampere University Hospital; Tampere Finland
| |
Collapse
|
53
|
Kumeta M, Gilmore JL, Umeshima H, Ishikawa M, Kitajiri SI, Horigome T, Kengaku M, Takeyasu K. Caprice/MISP is a novel F-actin bundling protein critical for actin-based cytoskeletal reorganizations. Genes Cells 2014; 19:338-49. [DOI: 10.1111/gtc.12131] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2013] [Accepted: 12/18/2013] [Indexed: 02/02/2023]
Affiliation(s)
- Masahiro Kumeta
- Graduate School of Biostudies; Kyoto University; Kyoto 606-8501 Japan
| | - Jamie L. Gilmore
- Graduate School of Biostudies; Kyoto University; Kyoto 606-8501 Japan
| | - Hiroki Umeshima
- Institute for Integrated Cell-Material Sciences (iCeMS); Kyoto University; Kyoto 606-8501 Japan
| | - Masaaki Ishikawa
- Graduate School of Medicine; Kyoto University; Kyoto 606-8507 Japan
| | | | - Tsuneyoshi Horigome
- Graduate School of Science and Technology; Niigata University; Niigata 950-2181 Japan
| | - Mineko Kengaku
- Institute for Integrated Cell-Material Sciences (iCeMS); Kyoto University; Kyoto 606-8501 Japan
| | - Kunio Takeyasu
- Graduate School of Biostudies; Kyoto University; Kyoto 606-8501 Japan
| |
Collapse
|
54
|
Kurotani A, Tokmakov AA, Kuroda Y, Fukami Y, Shinozaki K, Sakurai T. Correlations between predicted protein disorder and post-translational modifications in plants. ACTA ACUST UNITED AC 2014; 30:1095-1103. [PMID: 24403539 PMCID: PMC3982157 DOI: 10.1093/bioinformatics/btt762] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2013] [Accepted: 12/24/2013] [Indexed: 01/24/2023]
Abstract
MOTIVATION Protein structural research in plants lags behind that in animal and bacterial species. This lag concerns both the structural analysis of individual proteins and the proteome-wide characterization of structure-related properties. Until now, no systematic study concerning the relationships between protein disorder and multiple post-translational modifications (PTMs) in plants has been presented. RESULTS In this work, we calculated the global degree of intrinsic disorder in the complete proteomes of eight typical monocotyledonous and dicotyledonous plant species. We further predicted multiple sites for phosphorylation, glycosylation, acetylation and methylation and examined the correlations of protein disorder with the presence of the predicted PTM sites. It was found that phosphorylation, acetylation and O-glycosylation displayed a clear preference for occurrence in disordered regions of plant proteins. In contrast, methylation tended to avoid disordered sequence, whereas N-glycosylation did not show a universal structural preference in monocotyledonous and dicotyledonous plants. In addition, the analysis performed revealed significant differences between the integral characteristics of monocot and dicot proteomes. They included elevated disorder degree, increased rate of O-glycosylation and R-methylation, decreased rate of N-glycosylation, K-acetylation and K-methylation in monocotyledonous plant species, as compared with dicotyledonous species. Altogether, our study provides the most compelling evidence so far for the connection between protein disorder and multiple PTMs in plants. CONTACT tokmak@phoenix.kobe-u.ac.jp or tetsuya.sakurai@riken.jp Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Atsushi Kurotani
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| | - Alexander A Tokmakov
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| | - Yutaka Kuroda
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| | - Yasuo Fukami
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| | - Kazuo Shinozaki
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| | - Tetsuya Sakurai
- RIKEN Center for Sustainable Resource Science 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan, Department of Biotechnology and Life Science, Faculty of Technology, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo 184-8588, Japan and Research Center for Environmental Genomics, Kobe University, 1-1 Rokko dai, Nada, Kobe 657-8501, Japan
| |
Collapse
|
55
|
Shimizu K. POODLE: tools predicting intrinsically disordered regions of amino acid sequence. Methods Mol Biol 2014; 1137:131-45. [PMID: 24573479 DOI: 10.1007/978-1-4939-0366-5_10] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Protein intrinsic disorder, a widespread phenomenon characterized by a lack of stable three-dimensional structure, is thought to play an important role in protein function. In the last decade, dozens of computational methods for predicting intrinsic disorder from amino acid sequences have been developed. They are widely used by structural biologists not only for analyzing the biological function of intrinsic disorder but also for finding flexible regions that possibly hinder successful crystallization of the full-length protein. In this chapter, I introduce Prediction Of Order and Disorder by machine LEarning (POODLE), which is a series of programs accurately predicting intrinsic disorder. After giving the theoretical background for predicting intrinsic disorder, I give a detailed guide to using POODLE. I then also briefly introduce a case study where using POODLE for functional analyses of protein disorder led to a novel biological findings.
Collapse
Affiliation(s)
- Kana Shimizu
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
| |
Collapse
|
56
|
Becker J, Maes F, Wehenkel L. On the encoding of proteins for disordered regions prediction. PLoS One 2013; 8:e82252. [PMID: 24358161 PMCID: PMC3864923 DOI: 10.1371/journal.pone.0082252] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 10/21/2013] [Indexed: 12/02/2022] Open
Abstract
Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and understanding their formation is therefore a key sub-problem of protein structure and function inference. A wide range of machine learning approaches have been developed to automatically predict disordered regions of proteins. One key factor of the success of these methods is the way in which protein information is encoded into features. Recently, we have proposed a systematic methodology to study the relevance of various feature encodings in the context of disulfide connectivity pattern prediction. In the present paper, we adapt this methodology to the problem of predicting disordered regions and assess it on proteins from the 10th CASP competition, as well as on a very large subset of proteins extracted from PDB. Our results, obtained with ensembles of extremely randomized trees, highlight a novel feature function encoding the proximity of residues according to their accessibility to the solvent, which is playing the second most important role in the prediction of disordered regions, just after evolutionary information. Furthermore, even though our approach treats each residue independently, our results are very competitive in terms of accuracy with respect to the state-of-the-art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3Disorder.
Collapse
Affiliation(s)
- Julien Becker
- Bioinformatics and Modeling, GIGA-Research, University of Liege, Liege, Belgium
| | - Francis Maes
- Department of Electrical Engineering and Computer Science, Montefiore Institute, University of Liege, Liege, Belgium
- Declaratieve Talen en Artificiele Intelligentie, Departement Computerwetenschappen, University of Leuven, Leuven, Belgium
| | - Louis Wehenkel
- Department of Electrical Engineering and Computer Science, Montefiore Institute, University of Liege, Liege, Belgium
- * E-mail:
| |
Collapse
|
57
|
Hirose S, Noguchi T. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013; 13:1444-56. [PMID: 23436767 DOI: 10.1002/pmic.201200175] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2012] [Revised: 01/27/2013] [Accepted: 02/06/2013] [Indexed: 11/11/2022]
Abstract
Recombinant protein technology is essential for conducting protein science and using proteins as materials in pharmaceutical or industrial applications. Although obtaining soluble proteins is still a major experimental obstacle, knowledge about protein expression/solubility under standard conditions may increase the efficiency and reduce the cost of proteomics studies. In this study, we present a computational approach to estimate the probability of protein expression and solubility for two different protein expression systems: in vivo Escherichia coli and wheat germ cell-free, from only the sequence information. It implements two kinds of methods: a sequence/predicted structural property-based method that uses both the sequence and predicted structural features, and a sequence pattern-based method that utilizes the occurrence frequencies of sequence patterns. In the benchmark test, the proposed methods obtained F-scores of around 70%, and outperformed publicly available servers. Applying the proposed methods to genomic data revealed that proteins associated with translation or transcription have a strong tendency to be expressed as soluble proteins by the in vivo E. coli expression system. The sequence pattern-based method also has the potential to indicate a candidate region for modification, to increase protein solubility. All methods are available for free at the ESPRESSO server (http://mbs.cbrc.jp/ESPRESSO).
Collapse
Affiliation(s)
- Shuichi Hirose
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
| | | |
Collapse
|
58
|
Brangulis K, Petrovskis I, Kazaks A, Tars K, Ranka R. Crystal structure of the infectious phenotype-associated outer surface protein BBA66 from the Lyme disease agent Borrelia burgdorferi. Ticks Tick Borne Dis 2013; 5:63-8. [PMID: 24246708 DOI: 10.1016/j.ttbdis.2013.09.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2013] [Revised: 09/04/2013] [Accepted: 09/09/2013] [Indexed: 11/16/2022]
Abstract
Borrelia burgdorferi, the causative agent of Lyme disease is transmitted to the mammalian host organisms by infected Ixodes ticks. Transfer of the spirochaetal bacteria from Ixodes ticks to the warm-blooded mammalian organism provides a challenge for the bacteria to adapt and survive in the different environmental conditions. B. burgdorferi has managed to differentially express genes in response to the encountered changes such as temperature and pH variance or metabolic rate to survive in both environments. In recent years, much interest has been turned on genes that are upregulated during the borrelial transfer to mammalian organisms as this could reveal the proteins important in the pathogenesis of Lyme disease. BBA66 is one of the upregulated outer surface proteins thought to be important in the pathogenesis of B. burgdorferi as it has been found out that BBA66 is necessary during the transmission and propagation phase to initiate Lyme disease. As there is still little known about the pathogenesis of B. burgdorferi, we have solved the crystal structure of the outer surface protein BBA66 at 2.25Å resolution. A monomer of BBA66 consists of 6 α-helices packed in a globular domain, and the overall folding is similar to the homologous proteins BBA64, BBA73, and CspA. Structure-based sequence alignment with the homologous protein BBA64 revealed that the conserved residues are mainly located inwards the core region of the protein and thus may be required to maintain the overall fold of the protein. Unlike the other homologous proteins, BBA66 has an atypically long disordered region at the N terminus thought to act as a "tether" between the structural domain and the cell surface.
Collapse
Affiliation(s)
- Kalvis Brangulis
- Latvian Biomedical Research and Study Centre, Ratsupites 1, LV-1067 Riga, Latvia; Riga Stradins University, Dzirciema 16, LV-1007 Riga, Latvia.
| | | | | | | | | |
Collapse
|
59
|
Hamaguchi M, Kamikubo H, Suzuki KN, Hagihara Y, Yanagihara I, Sakata I, Kataoka M, Hamada D. Structural basis of α-catenin recognition by EspB from enterohaemorrhagic E. coli based on hybrid strategy using low-resolution structural and protein dissection. PLoS One 2013; 8:e71618. [PMID: 23967227 PMCID: PMC3743801 DOI: 10.1371/journal.pone.0071618] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2013] [Accepted: 07/02/2013] [Indexed: 01/29/2023] Open
Abstract
Enterohaemorrhagic E. coli (EHEC) induces actin reorganization of host cells by injecting various effectors into host cytosol through type III secretion systems. EspB is the natively partially folded EHEC effector which binds to host α-catenin to promote the actin bundling. However, its structural basis is poorly understood. Here, we characterize the overall structural properties of EspB based on low-resolution structural data in conjunction with protein dissection strategy. EspB showed a unique thermal response involving cold denaturation in the presence of denaturant according to far-UV circular dichroism (CD). Small angle X-ray scattering revealed the formation of a highly extended structure of EspB comparable to the ideal random coil. Various disorder predictions as well as CD spectra of EspB fragments identified the presence of α-helical structures around G41 to Q70. The fragment corresponding to this region indicated the thermal response similar to EspB. Moreover, this fragment showed a high affinity to C-terminal vinculin homology domain of α-catenin. The results clarified the importance of preformed α-helix of EspB for recognition of α-catenin.
Collapse
Affiliation(s)
- Mitsuhide Hamaguchi
- Department of Emergency Critical Care Medicine, School of Medicine, Kinki University, Osakasayama, Osaka, Japan
- Research Institute, Osaka Medical Center for Maternal and Child Health, Izumi, Japan
| | - Hironari Kamikubo
- Laboratory of Bioenergetics and Biophysics, Nara Institute of Science and Technology (NAIST), Ikoma, Nara, Japan
| | - Kayo N. Suzuki
- Research Institute, Osaka Medical Center for Maternal and Child Health, Izumi, Japan
| | - Yoshihisa Hagihara
- National Institute of Advanced Industrial Science and Technology (AIST), Ikeda, Osaka, Japan
| | - Itaru Yanagihara
- Research Institute, Osaka Medical Center for Maternal and Child Health, Izumi, Japan
| | - Ikuhiro Sakata
- Department of Emergency Critical Care Medicine, School of Medicine, Kinki University, Osakasayama, Osaka, Japan
| | - Mikio Kataoka
- Laboratory of Bioenergetics and Biophysics, Nara Institute of Science and Technology (NAIST), Ikoma, Nara, Japan
| | - Daizo Hamada
- Division of Structural Biology, Department of Biochemistry and Molecular Biology, Graduate School of Medicine, Kobe University, Chuo-ku, Kobe, Japan
| |
Collapse
|
60
|
Mizianty MJ, Peng Z, Kurgan L. MFDp2: Accurate predictor of disorder in proteins by fusion of disorder probabilities, content and profiles. INTRINSICALLY DISORDERED PROTEINS 2013; 1:e24428. [PMID: 28516009 PMCID: PMC5424793 DOI: 10.4161/idp.24428] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/18/2013] [Accepted: 03/23/2013] [Indexed: 11/28/2022]
Abstract
Intrinsically disordered proteins (IDPs) are either entirely disordered or contain disordered regions in their native state. IDPs were found to be abundant in complex organisms and implicated in numerous cellular processes. Experimental annotation of disorder lags behind the rapidly growing sizes of the protein databases, and thus computational methods are used to close this gap and to investigate the disorder. MFDp2 is a novel content-rich and user-friendly web server for sequence-based prediction of protein disorder that builds upon our residue-level disorder predictor MFDp and chain-level disorder content predictor DisCon. It applies novel post-processing filters and uses sequence alignment to improve predictive quality. Using a new benchmark data set, which has reduced sequence identity to corresponding training data sets, MFDp2 is shown to provide competitive predictive quality when compared with MFDp and a comprehensive set of 13 other state-of-the-art predictors, including publicly available versions of the top predictors from CASP9. Our server obtains the highest Mathews Correlation Coefficient (MCC) and the second best Area Under the receiver operating characteristic Curve (AUC). In addition to the disorder predictions, our server also outputs well-described sequence-derived information that allows profiling the predicted disorder. We conveniently visualize sequence conservation, predicted secondary structure, relative solvent accessibility and alignments to chains with annotated disorder. We allow predictions for multiple proteins at the same time and each prediction can be downloaded as text-based (parsable) file. The web server, which includes help pages and tutorial, is freely available at biomine.ece.ualberta.ca/MFDp2/.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering; University of Alberta; Edmonton, AB Canada
| | - Zhenling Peng
- Department of Electrical and Computer Engineering; University of Alberta; Edmonton, AB Canada
| | - Lukasz Kurgan
- Department of Electrical and Computer Engineering; University of Alberta; Edmonton, AB Canada
| |
Collapse
|
61
|
Fan X, Kurgan L. Accurate prediction of disorder in protein chains with a comprehensive and empirically designed consensus. J Biomol Struct Dyn 2013; 32:448-64. [DOI: 10.1080/07391102.2013.775969] [Citation(s) in RCA: 136] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
62
|
Crystal Structure of PAV1-137: A Protein from the Virus PAV1 That Infects Pyrococcus abyssi. ARCHAEA 2013; 2013:568053. [PMID: 23533329 PMCID: PMC3603647 DOI: 10.1155/2013/568053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2012] [Accepted: 02/12/2013] [Indexed: 11/18/2022]
Abstract
Pyrococcus abyssi virus 1 (PAV1) was the first virus particle infecting a hyperthermophilic Euryarchaeota (Pyrococcus abyssi strain GE23) that has been isolated and characterized. It is lemon shaped and is decorated with a short fibered tail. PAV1 morphologically resembles the fusiform members of the family Fuselloviridae or the genus Salterprovirus. The 18 kb dsDNA genome of PAV1 contains 25 predicted genes, most of them of unknown function. To help assigning functions to these proteins, we have initiated structural studies of the PAV1 proteome. We determined the crystal structure of a putative protein of 137 residues (PAV1-137) at a resolution of 2.2 Å. The protein forms dimers both in solution and in the crystal. The fold of PAV1-137 is a four-α-helical bundle analogous to those found in some eukaryotic adhesion proteins such as focal adhesion kinase, suggesting that PAV1-137 is involved in protein-protein interactions.
Collapse
|
63
|
Abstract
The introduction of the term ‘Tubulin Polymerization Promoting Protein (TPPP)-like proteins’ is suggested. They constitute a eukaryotic protein superfamily, characterized by the presence of the p25alpha domain (Pfam05517, IPR008907), and named after the first identified member, TPPP/p25, exhibiting microtubule stabilizing function. TPPP-like proteins can be grouped on the basis of two characteristics: the length of their p25alpha domain, which can be long, short, truncated or partial, and the presence or absence of additional domain(s). TPPPs, in the strict sense, contain no other domains but one long or short p25alpha one (long- and short-type TPPPs, respectively). Proteins possessing truncated p25alpha domain are first described in this paper. They evolved from the long-type TPPPs and can be considered as arthropod-specific paralogs of long-type TPPPs. Phylogenetic analysis shows that the two groups (long-type and truncated TPPPs) split in the common ancestor of arthropods. Incomplete p25alpha domains can be found in multidomain TPPP-like proteins as well. The various subfamilies occur with a characteristic phyletic distribution: e. g., animal genomes/proteomes contain almost without exception long-type TPPPs; the multidomain apicortins occur almost exclusively in apicomplexan parasites. There are no data about the physiological function of these proteins except two human long-type TPPP paralogs which are involved in developmental processes of the brain and the musculoskeletal system, respectively. I predict that the superfamily members containing long or partial p25alpha domain are often intrinsically disordered proteins, while those with short or truncated domain(s) are structurally ordered. Interestingly, members of this superfamily connected or maybe connected to diseases are intrinsically disordered proteins.
Collapse
Affiliation(s)
- Ferenc Orosz
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary.
| |
Collapse
|
64
|
Kasprzak JM, Czerwoniec A, Bujnicki JM. Molecular evolution of dihydrouridine synthases. BMC Bioinformatics 2012; 13:153. [PMID: 22741570 PMCID: PMC3674756 DOI: 10.1186/1471-2105-13-153] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2011] [Accepted: 05/24/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Dihydrouridine (D) is a modified base found in conserved positions in the D-loop of tRNA in Bacteria, Eukaryota, and some Archaea. Despite the abundant occurrence of D, little is known about its biochemical roles in mediating tRNA function. It is assumed that D may destabilize the structure of tRNA and thus enhance its conformational flexibility. D is generated post-transcriptionally by the reduction of the 5,6-double bond of a uridine residue in RNA transcripts. The reaction is carried out by dihydrouridine synthases (DUS). DUS constitute a conserved family of enzymes encoded by the orthologous gene family COG0042. In protein sequence databases, members of COG0042 are typically annotated as "predicted TIM-barrel enzymes, possibly dehydrogenases, nifR3 family". RESULTS To elucidate sequence-structure-function relationships in the DUS family, a comprehensive bioinformatic analysis was carried out. We performed extensive database searches to identify all members of the currently known DUS family, followed by clustering analysis to subdivide it into subfamilies of closely related sequences. We analyzed phylogenetic distributions of all members of the DUS family and inferred the evolutionary tree, which suggested a scenario for the evolutionary origin of dihydrouridine-forming enzymes. For a human representative of the DUS family, the hDus2 protein suggested as a potential drug target in cancer, we generated a homology model. While this article was under review, a crystal structure of a DUS representative has been published, giving us an opportunity to validate the model. CONCLUSIONS We compared sequences and phylogenetic distributions of all members of the DUS family and inferred the phylogenetic tree, which provides a framework to study the functional differences among these proteins and suggests a scenario for the evolutionary origin of dihydrouridine formation. Our evolutionary and structural classification of the DUS family provides a background to study functional differences among these proteins that will guide experimental analyses.
Collapse
Affiliation(s)
- Joanna M Kasprzak
- Institute of Molecular Biology and Biotechnology, Adam Mickiewicz University, Umultowska 89, PL-61-614 Poznan, Poland
| | | | | |
Collapse
|
65
|
MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinformatics 2012; 13:111. [PMID: 22624656 PMCID: PMC3465245 DOI: 10.1186/1471-2105-13-111] [Citation(s) in RCA: 265] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2011] [Accepted: 04/26/2012] [Indexed: 11/28/2022] Open
Abstract
Background Intrinsically unstructured proteins (IUPs) lack a well-defined three-dimensional structure. Some of them may assume a locally stable structure under specific conditions, e.g. upon interaction with another molecule, while others function in a permanently unstructured state. The discovery of IUPs challenged the traditional protein structure paradigm, which stated that a specific well-defined structure defines the function of the protein. As of December 2011, approximately 60 methods for computational prediction of protein disorder from sequence have been made publicly available. They are based on different approaches, such as utilizing evolutionary information, energy functions, and various statistical and machine learning methods. Results Given the diversity of existing intrinsic disorder prediction methods, we decided to test whether it is possible to combine them into a more accurate meta-prediction method. We developed a method based on arbitrarily chosen 13 disorder predictors, in which the final consensus was weighted by the accuracy of the methods. We have also developed a disorder predictor GSmetaDisorder3D that used no third-party disorder predictors, but alignments to known protein structures, reported by the protein fold-recognition methods, to infer the potentially structured and unstructured regions. Following the success of our disorder predictors in the CASP8 benchmark, we combined them into a meta-meta predictor called GSmetaDisorderMD, which was the top scoring method in the subsequent CASP9 benchmark. Conclusions A series of disorder predictors described in this article is available as a MetaDisorder web server at http://iimcb.genesilico.pl/metadisorder/. Results are presented both in an easily interpretable, interactive mode and in a simple text format suitable for machine processing.
Collapse
|
66
|
Cheng J, Li J, Wang Z, Eickholt J, Deng X. The MULTICOM toolbox for protein structure prediction. BMC Bioinformatics 2012; 13:65. [PMID: 22545707 PMCID: PMC3495398 DOI: 10.1186/1471-2105-13-65] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Accepted: 04/30/2012] [Indexed: 12/31/2022] Open
Abstract
Background As genome sequencing is becoming routine in biomedical research, the total number of protein sequences is increasing exponentially, recently reaching over 108 million. However, only a tiny portion of these proteins (i.e. ~75,000 or < 0.07%) have solved tertiary structures determined by experimental techniques. The gap between protein sequence and structure continues to enlarge rapidly as the throughput of genome sequencing techniques is much higher than that of protein structure determination techniques. Computational software tools for predicting protein structure and structural features from protein sequences are crucial to make use of this vast repository of protein resources. Results To meet the need, we have developed a comprehensive MULTICOM toolbox consisting of a set of protein structure and structural feature prediction tools. These tools include secondary structure prediction, solvent accessibility prediction, disorder region prediction, domain boundary prediction, contact map prediction, disulfide bond prediction, beta-sheet topology prediction, fold recognition, multiple template combination and alignment, template-based tertiary structure modeling, protein model quality assessment, and mutation stability prediction. Conclusions These tools have been rigorously tested by many users in the last several years and/or during the last three rounds of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7-9) from 2006 to 2010, achieving state-of-the-art or near performance. In order to facilitate bioinformatics research and technological development in the field, we have made the MULTICOM toolbox freely available as web services and/or software packages for academic use and scientific research. It is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.
Collapse
Affiliation(s)
- Jianlin Cheng
- Department of Computer Science, University of Missouri-Columbia, Columbia, MO 65211, USA.
| | | | | | | | | |
Collapse
|
67
|
Zhang T, Faraggi E, Xue B, Dunker AK, Uversky VN, Zhou Y. SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn 2012; 29:799-813. [PMID: 22208280 PMCID: PMC3297974 DOI: 10.1080/073911012010525022] [Citation(s) in RCA: 138] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies.
Collapse
Affiliation(s)
- Tuo Zhang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Eshel Faraggi
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Bin Xue
- Department of Molecular Medicine, University of South Florida, Tampa, FL 33612, USA
| | - A. Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Vladimir N. Uversky
- Department of Molecular Medicine, University of South Florida, Tampa, FL 33612, USA
- Institute for Biological Instrumentation, Russian Academy of Sciences, 142290 Pushchino, Moscow Region, Russia
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
68
|
Zhou Y, Liu J, Han L, Li ZG, Zhang Z. Comprehensive analysis of tandem amino acid repeats from ten angiosperm genomes. BMC Genomics 2011; 12:632. [PMID: 22195734 PMCID: PMC3283746 DOI: 10.1186/1471-2164-12-632] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2011] [Accepted: 12/23/2011] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The presence of tandem amino acid repeats (AARs) is one of the signatures of eukaryotic proteins. AARs were thought to be frequently involved in bio-molecular interactions. Comprehensive studies that primarily focused on metazoan AARs have suggested that AARs are evolving rapidly and are highly variable among species. However, there is still controversy over causal factors of this inter-species variation. In this work, we attempted to investigate this topic mainly by comparing AARs in orthologous proteins from ten angiosperm genomes. RESULTS Angiosperm AAR content is positively correlated with the GC content of the protein coding sequence. However, based on observations from fungal AARs and insect AARs, we argue that the applicability of this kind of correlation is limited by AAR residue composition and species' life history traits. Angiosperm AARs also tend to be fast evolving and structurally disordered, supporting the results of comprehensive analyses of metazoans. The functions of conserved long AARs are summarized. Finally, we propose that the rapid mRNA decay rate, alternative splicing and tissue specificity are regulatory processes that are associated with angiosperm proteins harboring AARs. CONCLUSIONS Our investigation suggests that GC content is a predictor of AAR content in the protein coding sequence under certain conditions. Although angiosperm AARs lack conservation and 3D structure, a fraction of the proteins that contain AARs may be functionally important and are under extensive regulation in plant cells.
Collapse
Affiliation(s)
- Yuan Zhou
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Jing Liu
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Lei Han
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Zhi-Gang Li
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
69
|
Walsh I, Martin AJM, Di Domenico T, Tosatto SCE. ESpritz: accurate and fast prediction of protein disorder. ACTA ACUST UNITED AC 2011; 28:503-9. [PMID: 22190692 DOI: 10.1093/bioinformatics/btr682] [Citation(s) in RCA: 405] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Intrinsically disordered regions are key for the function of numerous proteins, and the scant available experimental annotations suggest the existence of different disorder flavors. While efficient predictions are required to annotate entire genomes, most existing methods require sequence profiles for disorder prediction, making them cumbersome for high-throughput applications. RESULTS In this work, we present an ensemble of protein disorder predictors called ESpritz. These are based on bidirectional recursive neural networks and trained on three different flavors of disorder, including a novel NMR flexibility predictor. ESpritz can produce fast and accurate sequence-only predictions, annotating entire genomes in the order of hours on a single processor core. Alternatively, a slower but slightly more accurate ESpritz variant using sequence profiles can be used for applications requiring maximum performance. Two levels of prediction confidence allow either to maximize reasonable disorder detection or to limit expected false positives to 5%. ESpritz performs consistently well on the recent CASP9 data, reaching a S(w) measure of 54.82 and area under the receiver operator curve of 0.856. The fast predictor is four orders of magnitude faster and remains better than most publicly available CASP9 methods, making it ideal for genomic scale predictions. CONCLUSIONS ESpritz predicts three flavors of disorder at two distinct false positive rates, either with a fast or slower and slightly more accurate approach. Given its state-of-the-art performance, it can be especially useful for high-throughput applications. AVAILABILITY Both a web server for high-throughput analysis and a Linux executable version of ESpritz are available from: http://protein.bio.unipd.it/espritz/.
Collapse
Affiliation(s)
- Ian Walsh
- Department of Biology, University of Padua, Viale G. Colombo 3, I-35131 Padova, Italy
| | | | | | | |
Collapse
|
70
|
Fogl C, Puckey L, Hinssen U, Zaleska M, El-Mezgueldi M, Croasdale R, Bowman A, Matsukawa A, Samani NJ, Savva R, Pfuhl M. A structural and functional dissection of the cardiac stress response factor MS1. Proteins 2011; 80:398-409. [PMID: 22081479 DOI: 10.1002/prot.23201] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2011] [Revised: 09/02/2011] [Accepted: 09/07/2011] [Indexed: 11/11/2022]
Abstract
MS1 is a protein predominantly expressed in cardiac and skeletal muscle that is upregulated in response to stress and contributes to development of hypertrophy. In the aortic banding model of left ventricular hypertrophy, its cardiac expression was significantly upregulated within 1 h. Its function is postulated to depend on its F-actin binding ability, located to the C-terminal half of the protein, which promotes stabilization of F-actin in the cell thus releasing myocardin-related transcription factors to the nucleus where they stimulate transcription in cooperation with serum response factor. Initial attempts to purify the protein only resulted in heavily degraded samples that showed distinct bands on SDS gels, suggesting the presence of stable domains. Using a combination of combinatorial domain hunting and sequence analysis, a set of potential domains was identified. The C-terminal half of the protein actually contains two independent F-actin binding domains. The most C-terminal fragment (294-375), named actin binding domain 2 (ABD2), is independently folded while a proximal fragment called ABD1 (193-296) binds to F-actin with higher affinity than ABD2 (KD 2.21 ± 0.47 μM vs. 10.61 ± 0.7 μM), but is not structured by itself in solution. NMR interaction experiments show that it binds and folds in a cooperative manner to F-actin, justifying the label of domain. The architecture of the MS1 C-terminus suggests that ABD1 alone could completely fulfill the F-actin binding function opening up the intriguing possibility that ABD2, despite its high level of conservation, could have developed other functions.
Collapse
Affiliation(s)
- Claudia Fogl
- Department of Biochemistry, University of Leicester, Leicester LE1 9HN, United Kingdom
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
71
|
Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A. Evaluation of disorder predictions in CASP9. Proteins 2011; 79 Suppl 10:107-18. [PMID: 21928402 PMCID: PMC3212657 DOI: 10.1002/prot.23161] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2011] [Revised: 07/11/2011] [Accepted: 07/15/2011] [Indexed: 11/10/2022]
Abstract
Lack of stable three-dimensional structure, or intrinsic disorder, is a common phenomenon in proteins. Naturally, unstructured regions are proven to be essential for carrying function by many proteins, and therefore identification of such regions is an important issue. CASP has been assessing the state of the art in predicting disorder regions from amino acid sequence since 2002. Here, we present the results of the evaluation of the disorder predictions submitted to CASP9. The assessment is based on the evaluation measures and procedures used in previous CASPs. The balanced accuracy and the Matthews correlation coefficient were chosen as basic measures for evaluating the correctness of binary classifications. The area under the receiver operating characteristic curve was the measure of choice for evaluating probability-based predictions of disorder. The CASP9 methods are shown to perform slightly better than the CASP7 methods but not better than the methods in CASP8. It was also shown that capability of most CASP9 methods to predict disorder decreases with increasing minimum disorder segment length.
Collapse
Affiliation(s)
- Bohdan Monastyrskyy
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis, CA 95616, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis, CA 95616, USA
| | - John Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA
| | - Anna Tramontano
- Department of Physics, Sapienza University of Rome, P.le Aldo Moro 5, 00185 Rome, Italy
| | - Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis, CA 95616, USA
| |
Collapse
|
72
|
Ikeda T, Kuroda A. Why does the silica-binding protein “Si-tag” bind strongly to silica surfaces? Implications of conformational adaptation of the intrinsically disordered polypeptide to solid surfaces. Colloids Surf B Biointerfaces 2011; 86:359-63. [DOI: 10.1016/j.colsurfb.2011.04.020] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2010] [Revised: 03/15/2011] [Accepted: 04/12/2011] [Indexed: 10/18/2022]
|
73
|
Deng X, Eickholt J, Cheng J. A comprehensive overview of computational protein disorder prediction methods. MOLECULAR BIOSYSTEMS 2011; 8:114-21. [PMID: 21874190 DOI: 10.1039/c1mb05207a] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Over the past decade there has been a growing acknowledgement that a large proportion of proteins within most proteomes contain disordered regions. Disordered regions are segments of the protein chain which do not adopt a stable structure. Recognition of disordered regions in a protein is of great importance for protein structure prediction, protein structure determination and function annotation as these regions have a close relationship with protein expression and functionality. As a result, a great many protein disorder prediction methods have been developed so far. Here, we present an overview of current protein disorder prediction methods including an analysis of their advantages and shortcomings. In order to help users to select alternative tools under different circumstances, we also evaluate 23 disorder predictors on the benchmark data of the most recent round of the Critical Assessment of protein Structure Prediction (CASP) and assess their accuracy using several complementary measures.
Collapse
Affiliation(s)
- Xin Deng
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
74
|
Orosz F. Apicomplexan apicortins possess a long disordered N-terminal extension. INFECTION GENETICS AND EVOLUTION 2011; 11:1037-44. [DOI: 10.1016/j.meegid.2011.03.023] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Revised: 03/24/2011] [Accepted: 03/25/2011] [Indexed: 01/01/2023]
|
75
|
Fu SC, Imai K, Horton P. Prediction of leucine-rich nuclear export signal containing proteins with NESsential. Nucleic Acids Res 2011; 39:e111. [PMID: 21705415 PMCID: PMC3167595 DOI: 10.1093/nar/gkr493] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The classical nuclear export signal (NES), also known as the leucine-rich NES, is a protein localization signal often involved in important processes such as signal transduction and cell cycle regulation. Although 15 years has passed since its discovery, limited structural information and high sequence diversity have hampered understanding of the NES. Several consensus sequences have been proposed to describe it, but they suffer from poor predictive power. On the other hand, the NetNES server provides the only computational method currently available. Although these two methods have been widely used to attempt to find the correct NES position within potential NES-containing proteins, their performance has not yet been evaluated on the basic task of identifying NES-containing proteins. We propose a new predictor, NESsential, which uses sequence derived meta-features, such as predicted disorder and solvent accessibility, in addition to primary sequence. We demonstrate that it can identify promising NES-containing candidate proteins (albeit at low coverage), but other methods cannot. We also quantitatively demonstrate that predicted disorder is a useful feature for prediction and investigate the different features of (predicted) ordered versus disordered NES's. Finally, we list 70 recently discovered NES-containing proteins, doubling the number available to the community.
Collapse
Affiliation(s)
- Szu-Chin Fu
- Department of Computational Biology, Graduate School of Frontier Science, University of Tokyo, Kashiwa 277-8561, Japan
| | | | | |
Collapse
|
76
|
Fukuchi S, Hosoda K, Homma K, Gojobori T, Nishikawa K. Binary classification of protein molecules into intrinsically disordered and ordered segments. BMC STRUCTURAL BIOLOGY 2011; 11:29. [PMID: 21693062 PMCID: PMC3199747 DOI: 10.1186/1472-6807-11-29] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Accepted: 06/22/2011] [Indexed: 11/17/2022]
Abstract
Background Although structural domains in proteins (SDs) are important, half of the regions in the human proteome are currently left with no SD assignments. These unassigned regions consist not only of novel SDs, but also of intrinsically disordered (ID) regions since proteins, especially those in eukaryotes, generally contain a significant fraction of ID regions. As ID regions can be inferred from amino acid sequences, a method that combines SD and ID region assignments can determine the fractions of SDs and ID regions in any proteome. Results In contrast to other available ID prediction programs that merely identify likely ID regions, the DICHOT system we previously developed classifies the entire protein sequence into SDs and ID regions. Application of DICHOT to the human proteome revealed that residue-wise ID regions constitute 35%, SDs with similarity to PDB structures comprise 52%, while SDs with no similarity to PDB structures account for the remaining 13%. The last group consists of novel structural domains, termed cryptic domains, which serve as good targets of structural genomics. The DICHOT method applied to the proteomes of other model organisms indicated that eukaryotes generally have high ID contents, while prokaryotes do not. In human proteins, ID contents differ among subcellular localizations: nuclear proteins had the highest residue-wise ID fraction (47%), while mitochondrial proteins exhibited the lowest (13%). Phosphorylation and O-linked glycosylation sites were found to be located preferentially in ID regions. As O-linked glycans are attached to residues in the extracellular regions of proteins, the modification is likely to protect the ID regions from proteolytic cleavage in the extracellular environment. Alternative splicing events tend to occur more frequently in ID regions. We interpret this as evidence that natural selection is operating at the protein level in alternative splicing. Conclusions We classified entire regions of proteins into the two categories, SDs and ID regions and thereby obtained various kinds of complete genome-wide statistics. The results of the present study are important basic information for understanding protein structural architectures and have been made publicly available at http://spock.genes.nig.ac.jp/~genome/DICHOT.
Collapse
Affiliation(s)
- Satoshi Fukuchi
- Center for Information Biology & DNA Data Bank of Japan, National Institute of Genetics, Yata 1111, Mishima, Shizuoka 411-8540, Japan.
| | | | | | | | | |
Collapse
|
77
|
Mizianty MJ, Zhang T, Xue B, Zhou Y, Dunker AK, Uversky VN, Kurgan L. In-silico prediction of disorder content using hybrid sequence representation. BMC Bioinformatics 2011; 12:245. [PMID: 21682902 PMCID: PMC3212983 DOI: 10.1186/1471-2105-12-245] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2010] [Accepted: 06/17/2011] [Indexed: 11/25/2022] Open
Abstract
Background Intrinsically disordered proteins play important roles in various cellular activities and their prevalence was implicated in a number of human diseases. The knowledge of the content of the intrinsic disorder in proteins is useful for a variety of studies including estimation of the abundance of disorder in protein families, classes, and complete proteomes, and for the analysis of disorder-related protein functions. The above investigations currently utilize the disorder content derived from the per-residue disorder predictions. We show that these predictions may over-or under-predict the overall amount of disorder, which motivates development of novel tools for direct and accurate sequence-based prediction of the disorder content. Results We hypothesize that sequence-level aggregation of input information may provide more accurate content prediction when compared with the content extracted from the local window-based residue-level disorder predictors. We propose a novel predictor, DisCon, that takes advantage of a small set of 29 custom-designed descriptors that aggregate and hybridize information concerning sequence, evolutionary profiles, and predicted secondary structure, solvent accessibility, flexibility, and annotation of globular domains. Using these descriptors and a ridge regression model, DisCon predicts the content with low, 0.05, mean squared error and high, 0.68, Pearson correlation. This is a statistically significant improvement over the content computed from outputs of ten modern disorder predictors on a test dataset with proteins that share low sequence identity with the training sequences. The proposed predictive model is analyzed to discuss factors related to the prediction of the disorder content. Conclusions DisCon is a high-quality alternative for high-throughput annotation of the disorder content. We also empirically demonstrate that the DisCon's predictions can be used to improve binary annotations of the disordered residues from the real-value disorder propensities generated by current residue-level disorder predictors. The web server that implements the DisCon is available at http://biomine.ece.ualberta.ca/DisCon/.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | | | | | | | | | | | | |
Collapse
|
78
|
Abstract
MOTIVATION Predictions, and experiments to a lesser extent, following the decoding of the human genome showed that a significant fraction of gene products do not have well-defined 3D structures. While the presence of structured domains traditionally suggested function, it was not clear what the absence of structure implied. These and many other findings initiated the extensive theoretical and experimental research into these types of proteins, commonly known as intrinsically disordered proteins (IDPs). Crucial to understanding IDPs is the evaluation of structural predictors based on different principles and trained on various datasets, which is currently the subject of active research. The view is emerging that structural disorder can be considered as a separate structural category and not simply as absence of secondary and/or tertiary structure. IDPs perform essential functions and their improper functioning is responsible for human diseases such as neurodegenerative disorders.
Collapse
Affiliation(s)
- Ferenc Orosz
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Karolina út 29, Budapest, H-1113 Hungary.
| | | |
Collapse
|
79
|
Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T. Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. ACTA ACUST UNITED AC 2011; 150:73-81. [DOI: 10.1093/jb/mvr042] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
80
|
Dr. PIAS: an integrative system for assessing the druggability of protein-protein interactions. BMC Bioinformatics 2011; 12:50. [PMID: 21303559 PMCID: PMC3228542 DOI: 10.1186/1471-2105-12-50] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 02/09/2011] [Indexed: 01/09/2023] Open
Abstract
Background The amount of data on protein-protein interactions (PPIs) available in public databases and in the literature has rapidly expanded in recent years. PPI data can provide useful information for researchers in pharmacology and medicine as well as those in interactome studies. There is urgent need for a novel methodology or software allowing the efficient utilization of PPI data in pharmacology and medicine. Results To address this need, we have developed the 'Druggable Protein-protein Interaction Assessment System' (Dr. PIAS). Dr. PIAS has a meta-database that stores various types of information (tertiary structures, drugs/chemicals, and biological functions associated with PPIs) retrieved from public sources. By integrating this information, Dr. PIAS assesses whether a PPI is druggable as a target for small chemical ligands by using a supervised machine-learning method, support vector machine (SVM). Dr. PIAS holds not only known druggable PPIs but also all PPIs of human, mouse, rat, and human immunodeficiency virus (HIV) proteins identified to date. Conclusions The design concept of Dr. PIAS is distinct from other published PPI databases in that it focuses on selecting the PPIs most likely to make good drug targets, rather than merely collecting PPI data.
Collapse
|
81
|
McDermott JE, Corrigan A, Peterson E, Oehmen C, Niemann G, Cambronne ED, Sharp D, Adkins JN, Samudrala R, Heffron F. Computational prediction of type III and IV secreted effectors in gram-negative bacteria. Infect Immun 2011; 79:23-32. [PMID: 20974833 PMCID: PMC3019878 DOI: 10.1128/iai.00537-10] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
In this review, we provide an overview of the methods employed in four recent studies that described novel methods for computational prediction of secreted effectors from type III and IV secretion systems in Gram-negative bacteria. We present the results of these studies in terms of performance at accurately predicting secreted effectors and similarities found between secretion signals that may reflect biologically relevant features for recognition. We discuss the Web-based tools for secreted effector prediction described in these studies and announce the availability of our tool, the SIEVE server (http://www.sysbep.org/sieve). Finally, we assess the accuracies of the three type III effector prediction methods on a small set of proteins not known prior to the development of these tools that we recently discovered and validated using both experimental and computational approaches. Our comparison shows that all methods use similar approaches and, in general, arrive at similar conclusions. We discuss the possibility of an order-dependent motif in the secretion signal, which was a point of disagreement in the studies. Our results show that there may be classes of effectors in which the signal has a loosely defined motif and others in which secretion is dependent only on compositional biases. Computational prediction of secreted effectors from protein sequences represents an important step toward better understanding the interaction between pathogens and hosts.
Collapse
Affiliation(s)
- Jason E McDermott
- Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory, MSIN: J4-33, 902 Battelle Boulevard, P.O. Box 999, Richland, WA 99352, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
82
|
Ebina T, Toh H, Kuroda Y. DROP: an SVM domain linker predictor trained with optimal features selected by random forest. ACTA ACUST UNITED AC 2010; 27:487-94. [PMID: 21169376 DOI: 10.1093/bioinformatics/btq700] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Biologically important proteins are often large, multidomain proteins, which are difficult to characterize by high-throughput experimental methods. Efficient domain/boundary predictions are thus increasingly required in diverse area of proteomics research for computationally dissecting proteins into readily analyzable domains. RESULTS We constructed a support vector machine (SVM)-based domain linker predictor, DROP (Domain linker pRediction using OPtimal features), which was trained with 25 optimal features. The optimal combination of features was identified from a set of 3000 features using a random forest algorithm complemented with a stepwise feature selection. DROP demonstrated a prediction sensitivity and precision of 41.3 and 49.4%, respectively. These values were over 19.9% higher than those of control SVM predictors trained with non-optimized features, strongly suggesting the efficiency of our feature selection method. In addition, the mean NDO-Score of DROP for predicting novel domains in seven CASP8 FM multidomain proteins was 0.760, which was higher than any of the 12 published CASP8 DP servers. Overall, these results indicate that the SVM prediction of domain linkers can be improved by identifying optimal features that best distinguish linker from non-linker regions.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, Koganei-shi, Tokyo 184-8588, Japan
| | | | | |
Collapse
|
83
|
Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. ACTA ACUST UNITED AC 2010; 26:i489-96. [PMID: 20823312 PMCID: PMC2935446 DOI: 10.1093/bioinformatics/btq373] [Citation(s) in RCA: 132] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Motivation: Intrinsically disordered proteins play a crucial role in numerous regulatory processes. Their abundance and ubiquity combined with a relatively low quantity of their annotations motivate research toward the development of computational models that predict disordered regions from protein sequences. Although the prediction quality of these methods continues to rise, novel and improved predictors are urgently needed. Results: We propose a novel method, named MFDp (Multilayered Fusion-based Disorder predictor), that aims to improve over the current disorder predictors. MFDp is as an ensemble of 3 Support Vector Machines specialized for the prediction of short, long and generic disordered regions. It combines three complementary disorder predictors, sequence, sequence profiles, predicted secondary structure, solvent accessibility, backbone dihedral torsion angles, residue flexibility and B-factors. Our method utilizes a custom-designed set of features that are based on raw predictions and aggregated raw values and recognizes various types of disorder. The MFDp is compared at the residue level on two datasets against eight recent disorder predictors and top-performing methods from the most recent CASP8 experiment. In spite of using training chains with ≤25% similarity to the test sequences, our method consistently and significantly outperforms the other methods based on the MCC index. The MFDp outperforms modern disorder predictors for the binary disorder assignment and provides competitive real-valued predictions. The MFDp's outputs are also shown to outperform the other methods in the identification of proteins with long disordered regions. Availability:http://biomine.ece.ualberta.ca/MFDp.html Supplementary information:Supplementary data are available at Bioinformatics online. Contact:lkurgan@ece.ualberta.ca
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
| | | | | | | | | | | |
Collapse
|
84
|
Hirose S, Kawamura Y, Mori M, Yokota K, Noguchi T, Goshima N. Development and evaluation of data-driven designed tags (DDTs) for controlling protein solubility. N Biotechnol 2010; 28:225-31. [PMID: 20837175 DOI: 10.1016/j.nbt.2010.08.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2010] [Accepted: 08/30/2010] [Indexed: 11/29/2022]
Abstract
Production of proteins is an important issue in protein science and pharmaceutical studies. Numerous protein expression systems using living cells and cell-free methods have been developed to date. In these systems, a promising strategy for improving the success rate of obtaining soluble proteins is the attachment of various tags into target proteins based on empirical rules. This paper presents a method for the production of data-driven designed tags (DDTs) based on highly frequent sequence property patterns in an experimentally assessed protein solubility dataset in a wheat germ cell-free system. We constructed seven proteins combined with 12 kinds of DDTs (six for enhancing solubility and six for insolubility) at the N-terminal region as tags. Then we investigated their behavior using SDS-PAGE. Results show that three and four proteins respectively showed a trend toward solubilization and insolubilization, which indicates the possibility that the theoretically designed sequence can control protein solubility.
Collapse
Affiliation(s)
- Shuichi Hirose
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
| | | | | | | | | | | |
Collapse
|
85
|
Shah AR, Agarwal K, Baker ES, Singhal M, Mayampurath AM, Ibrahim YM, Kangas LJ, Monroe ME, Zhao R, Belov ME, Anderson GA, Smith RD. Machine learning based prediction for peptide drift times in ion mobility spectrometry. Bioinformatics 2010; 26:1601-7. [PMID: 20495001 PMCID: PMC2913656 DOI: 10.1093/bioinformatics/btq245] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Revised: 04/18/2010] [Accepted: 05/02/2010] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Ion mobility spectrometry (IMS) has gained significant traction over the past few years for rapid, high-resolution separations of analytes based upon gas-phase ion structure, with significant potential impacts in the field of proteomic analysis. IMS coupled with mass spectrometry (MS) affords multiple improvements over traditional proteomics techniques, such as in the elucidation of secondary structure information, identification of post-translational modifications, as well as higher identification rates with reduced experiment times. The high throughput nature of this technique benefits from accurate calculation of cross sections, mobilities and associated drift times of peptides, thereby enhancing downstream data analysis. Here, we present a model that uses physicochemical properties of peptides to accurately predict a peptide's drift time directly from its amino acid sequence. This model is used in conjunction with two mathematical techniques, a partial least squares regression and a support vector regression setting. RESULTS When tested on an experimentally created high confidence database of 8675 peptide sequences with measured drift times, both techniques statistically significantly outperform the intrinsic size parameters-based calculations, the currently held practice in the field, on all charge states (+2, +3 and +4). AVAILABILITY The software executable, imPredict, is available for download from http:/omics.pnl.gov/software/imPredict.php CONTACT rds@pnl.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anuj R Shah
- Fundamental and Computational Sciences Directorate, Pacific Northwest National Laboratory, 999 Battelle Boulevard, Richland, WA 99352, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
86
|
Uversky VN, Dunker AK. Understanding protein non-folding. BIOCHIMICA ET BIOPHYSICA ACTA 2010; 1804:1231-64. [PMID: 20117254 PMCID: PMC2882790 DOI: 10.1016/j.bbapap.2010.01.017] [Citation(s) in RCA: 935] [Impact Index Per Article: 62.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/21/2009] [Revised: 01/09/2010] [Accepted: 01/21/2010] [Indexed: 02/07/2023]
Abstract
This review describes the family of intrinsically disordered proteins, members of which fail to form rigid 3-D structures under physiological conditions, either along their entire lengths or only in localized regions. Instead, these intriguing proteins/regions exist as dynamic ensembles within which atom positions and backbone Ramachandran angles exhibit extreme temporal fluctuations without specific equilibrium values. Many of these intrinsically disordered proteins are known to carry out important biological functions which, in fact, depend on the absence of a specific 3-D structure. The existence of such proteins does not fit the prevailing structure-function paradigm, which states that a unique 3-D structure is a prerequisite to function. Thus, the protein structure-function paradigm has to be expanded to include intrinsically disordered proteins and alternative relationships among protein sequence, structure, and function. This shift in the paradigm represents a major breakthrough for biochemistry, biophysics and molecular biology, as it opens new levels of understanding with regard to the complex life of proteins. This review will try to answer the following questions: how were intrinsically disordered proteins discovered? Why don't these proteins fold? What is so special about intrinsic disorder? What are the functional advantages of disordered proteins/regions? What is the functional repertoire of these proteins? What are the relationships between intrinsically disordered proteins and human diseases?
Collapse
Affiliation(s)
- Vladimir N Uversky
- Institute for Intrinsically Disordered Protein Research, Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA.
| | | |
Collapse
|
87
|
Bulashevska S, Bulashevska A, Eils R. Bayesian statistical modelling of human protein interaction network incorporating protein disorder information. BMC Bioinformatics 2010; 11:46. [PMID: 20100321 PMCID: PMC2831004 DOI: 10.1186/1471-2105-11-46] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2008] [Accepted: 01/25/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We present a statistical method of analysis of biological networks based on the exponential random graph model, namely p2-model, as opposed to previous descriptive approaches. The model is capable to capture generic and structural properties of a network as emergent from local interdependencies and uses a limited number of parameters. Here, we consider one global parameter capturing the density of edges in the network, and local parameters representing each node's contribution to the formation of edges in the network. The modelling suggests a novel definition of important nodes in the network, namely social, as revealed based on the local sociality parameters of the model. Moreover, the sociality parameters help to reveal organizational principles of the network. An inherent advantage of our approach is the possibility of hypotheses testing: a priori knowledge about biological properties of the nodes can be incorporated into the statistical model to investigate its influence on the structure of the network. RESULTS We applied the statistical modelling to the human protein interaction network obtained with Y2H experiments. Bayesian approach for the estimation of the parameters was employed. We deduced social proteins, essential for the formation of the network, while incorporating into the model information on protein disorder. Intrinsically disordered are proteins which lack a well-defined three-dimensional structure under physiological conditions. We predicted the fold group (ordered or disordered) of proteins in the network from their primary sequences. The network analysis indicated that protein disorder has a positive effect on the connectivity of proteins in the network, but do not fully explains the interactivity. CONCLUSIONS The approach opens a perspective to study effects of biological properties of individual entities on the structure of biological networks.
Collapse
Affiliation(s)
- Svetlana Bulashevska
- Theoretical Bioinformatics Department, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Alla Bulashevska
- Theoretical Bioinformatics Department, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Roland Eils
- Theoretical Bioinformatics Department, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute of Pharmacy and Molecular Biotechnology (IPMB) and Bioquant, University of Heidelberg, Germany
| |
Collapse
|
88
|
Rangwala H, Kauffman C, Karypis G. svmPRAT: SVM-based protein residue annotation toolkit. BMC Bioinformatics 2009; 10:439. [PMID: 20028521 PMCID: PMC2805646 DOI: 10.1186/1471-2105-10-439] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY http://www.cs.gmu.edu/~mlbio/svmprat.
Collapse
Affiliation(s)
- Huzefa Rangwala
- Computer Science Department, George Mason University, Fairfax, VA, USA.
| | | | | |
Collapse
|
89
|
Zeng T, Liu J. Mixture classification model based on clinical markers for breast cancer prognosis. Artif Intell Med 2009; 48:129-37. [PMID: 20005686 DOI: 10.1016/j.artmed.2009.07.008] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2008] [Revised: 07/09/2009] [Accepted: 07/20/2009] [Indexed: 01/09/2023]
Abstract
OBJECTIVE Accurate cancer prognosis prediction is critical to cancer treatment. There have been many prognosis models based on clinical markers, but few of them are satisfied in clinical applications. And with the development of microarray technologies, cancer researchers have discovered many genes as new markers from the gene expression data and have further developed powerful prognosis models based on these so-called genetic biomarkers. However, the application of such biomarkers still suffers from some problems. The first one is there are a great number of genes and a few samples in the gene expression data so that it is difficult to select a unified gene set to establish a stable classifier for prognosis. The second one is that, due to the experimental and technical reasons, there are existing noises and redundancies in gene expression data, which may lead to building a prognosis predictor with poor performance. The last but not the least one is the microarray experiments are so expensive currently that it is hard to obtain abundant samples. Therefore, it is practical to develop prognosis methods mainly based on conventional clinical markers in real cancer treatment applications. This paper aims to establish an accurate classification model for cancer prognosis, in order to make full use of the invaluable information in clinical data, especially which is usually ignored by most of the existing methods when they aim for high prediction accuracies. METHODS First, this paper gives the formal description of general classification problem, and presents a novel mixture classification model to make full use of the invaluable information in clinical data, which is similar to the traditional ensemble classification models except for putting strict constraints on the construction of mapping functions to avoid voting process. Then, a two-layer instance of the proposed model, named as MRS (Mixture of Rough set and Support vector machine), is constructed by integrating rough set and support vector machine (SVM) classification methods, in which, the rough set classifier acts as the first layer to identify some singular samples in data, and the SVM classifier acts as the second layer to classify the remaining samples. Finally, MRS is used to make prognosis prediction on two open breast cancer datasets. One dataset, denoted as BRC-1 hereafter, is a high quality, publicly available dataset of 97 breast cancer tumors of node-negative patients. The other, denoted as BRC-2 hereafter, uses baseline human primary breast tumor data from LBL breast cancer cell collection containing 174 samples. RESULTS We have done two experiments on BRC-1 and BRC-2, respectively. In the first experiment, the BRC-1 dataset is divided into train set with 78 patients (34 ones belonging to poor prognosis group and 44 ones belonging to good prognosis group) and test set with 19 patients (12 ones belonging to poor prognosis group and 7 ones belonging to good prognosis). After trained on the train set, the MRS can correctly classify all the 12 patients with poor prognosis, and 6 of 7 patients with good prognosis in the test set. The results are better than previous researches, even better than the 70-gene based biomarkers. And in the second experiment, we construct the classifiers using BRC-2 dataset, and compare MRS with other representative methods in Weka software by 5-fold cross-validation, and comparison results show that MRS has higher prediction accuracy than those methods. CONCLUSIONS The proposed mixture classification model can easily integrate methods with different characteristics. It can overcome the shortcomings of traditional voting-based ensemble models and thus can make full use of the information in clinical data. The experimental results illustrate that our implemented MRS classifier can predict the breast cancer prognosis more accurately than previous prognostic methods.
Collapse
Affiliation(s)
- Tao Zeng
- School of Computer, Wuhan University, Wuhan 430079, China
| | | |
Collapse
|
90
|
Dosztanyi Z, Meszaros B, Simon I. Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Brief Bioinform 2009; 11:225-43. [DOI: 10.1093/bib/bbp061] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
91
|
Liu YC, Yang MH, Lin WL, Huang CK, Oyang YJ. A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins. BMC Genomics 2009; 10 Suppl 3:S22. [PMID: 19958486 PMCID: PMC2788375 DOI: 10.1186/1471-2164-10-s3-s22] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment. As it has been observed in laboratories that protein flexibility is correlated to essential biological functions, scientists have been designing various types of predictors for identifying structurally flexible regions in proteins. In this respect, there are two major categories of predictors. One category of predictors attempts to identify conformationally flexible regions through analysis of protein tertiary structures. Another category of predictors works completely based on analysis of the polypeptide sequences. As the availability of protein tertiary structures is generally limited, the design of predictors that work completely based on sequence information is crucial for advances of molecular biology research. Results In this article, we propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary prediction powers. Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the proposed hybrid predictor is capable of providing the biologists with valuable clues about the functional sites in a protein chain. The proposed hybrid predictor provides the users with two optional modes, namely, the high-sensitivity mode and the high-specificity mode. The experimental results with an independent testing data set show that the proposed hybrid predictor is capable of delivering sensitivity of 0.710 and specificity of 0.608 under the high-sensitivity mode, while delivering sensitivity of 0.451 and specificity of 0.787 under the high-specificity mode. Conclusion Though experimental results show that the hybrid approach designed to exploit the complementary prediction powers of distinctive supervised learning algorithms works more effectively than conventional approaches, there exists a large room for further improvement with respect to the achieved performance. In this respect, it is of interest to investigate the effects of exploiting additional physiochemical properties that are related to conformational ambivalence. Furthermore, it is of interest to investigate the effects of incorporating lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design. As conformational transition plays a key role in carrying out several essential types of biological functions, the design of more advanced predictors for identifying conformationally ambivalent regions in proteins deserves our continuous attention.
Collapse
Affiliation(s)
- Yu-Cheng Liu
- Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan, Republic of China.
| | | | | | | | | |
Collapse
|
92
|
Sakharkar MK, Sakharkar KR, Chow VTK. Human genomic diversity, viral genomics and proteomics, as exemplified by human papillomaviruses and H5N1 influenza viruses. Hum Genomics 2009; 3:320-31. [PMID: 19706363 PMCID: PMC3525194 DOI: 10.1186/1479-7364-3-4-320] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The diversity of hosts, pathogens and host-pathogen relationships reflects the influence of selective pressures that fuel diversity through ongoing interactions with other rapidly evolving molecules in the environment. This paper discusses specific examples illustrating the phenomenon of diversity of hosts and pathogens, with special reference to human papillomaviruses and H5NI influenza viruses. We also review the influence of diverse host-pathogen interactions that determine the pathophysiology of infections, and their responses to drugs or vaccines.
Collapse
Affiliation(s)
- Meena K Sakharkar
- Biomedical Engineering Research Centre, Nanyang Technological University, Singapore
| | | | | |
Collapse
|
93
|
Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat 2009; 30:703-14. [PMID: 19267389 DOI: 10.1002/humu.20938] [Citation(s) in RCA: 180] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Many gene defects are relatively easy to identify experimentally, but obtaining information about the effects of sequence variations and elucidation of the detailed molecular mechanisms of genetic diseases will be among the next major efforts in mutation research. Amino acid substitutions may have diverse effects on protein structure and function; thus, a detailed analysis of the mutations is essential. Experimental study of the molecular effects of mutations is laborious, whereas useful and reliable information about the effects of amino acid substitutions can readily be obtained by theoretical methods. Experimentally defined structures and molecular modeling can be used as a basis for interpretation of the mutations. The effects of missense mutations can be analyzed even when the 3D structure of the protein has not been determined, although structure-based analyses are more reliable. Structural analyses include studies of the contacts between residues, their implication for the stability of the protein, and the effects of the introduced residues. Investigations of steric and stereochemical consequences of substitutions provide insights on the molecular fit of the introduced residue. Mutations that change the electrostatic surface potential of a protein have wide-ranging effects. Analyses of the effects of mutations on interactions with ligands and partners have been performed for elucidation of functional mutations. We have employed numerous methods for predicting the effects of amino acid substitutions. We discuss the applicability of these methods in the analysis of genes, proteins, and diseases to reveal protein structure-function relationships, which is essential to gain insights into disease genotype-phenotype correlations.
Collapse
Affiliation(s)
- Janita Thusberg
- Institute of Medical Technology, FI-33014 University of Tampere, Finland
| | | |
Collapse
|
94
|
|
95
|
Ebina T, Toh H, Kuroda Y. Loop-length-dependent SVM prediction of domain linkers for high-throughput structural proteomics. Biopolymers 2009; 92:1-8. [PMID: 18844295 DOI: 10.1002/bip.21105] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The prediction of structural domains in novel protein sequences is becoming of practical importance. One important area of application is the development of computer-aided techniques for identifying, at a low cost, novel protein domain targets for large-scale functional and structural proteomics. Here, we report a loop-length-dependent support vector machine (SVM) prediction of domain linkers, which are loops separating two structural domains. (DLP-SVM is freely available at: http://www.tuat.ac.jp/ approximately domserv/cgi-bin/DLP-SVM.cgi.) We constructed three loop-length-dependent SVM predictors of domain linkers (SVM-All, SVM-Long and SVM-Short), and also built SVM-Joint, which combines the results of SVM-Short and SVM-Long into a single consolidated prediction. The performances of SVM-Joint were, in most aspects, the highest, with a sensitivity of 59.7% and a specificity of 43.6%, which indicated that the specificity and the sensitivity were improved by over 2 and 3% respectively, when loop-length-dependent characteristics were taken into account. Furthermore, the sensitivity and specificity of SVM-Joint were, respectively, 37.6 and 17.4% higher than those of a random guess, and also superior to those of previously reported domain linker predictors. These results indicate that SVMs can be used to predict domain linkers, and that loop-length-dependent characteristics are useful for improving SVM prediction performances.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Naka-machi, Koganei-shi, Tokyo 184-8588, Japan
| | | | | |
Collapse
|
96
|
Fukuchi S, Homma K, Minezaki Y, Gojobori T, Nishikawa K. Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors. BMC STRUCTURAL BIOLOGY 2009; 9:26. [PMID: 19402914 PMCID: PMC2687452 DOI: 10.1186/1472-6807-9-26] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2009] [Accepted: 04/30/2009] [Indexed: 12/26/2022]
Abstract
BACKGROUND In addition to structural domains, most eukaryotic proteins possess intrinsically disordered (ID) regions. Although ID regions often play important functional roles, their accurate identification is difficult. As human transcription factors (TFs) constitute a typical group of proteins with long ID regions, we regarded them as a model of all proteins and attempted to accurately classify TFs into structural domains and ID regions. Although an extremely high fraction of ID regions besides DNA binding and/or other domains was detected in human TFs in our previous investigation, 20% of the residues were left unassigned. In this report, we exploit the generally higher sequence divergence in ID regions than in structural regions to completely divide proteins into structural domains and ID regions. RESULTS The new dichotomic system first identifies domains of known structures, followed by assignment of structural domains and ID regions with a combination of pre-existing tools and a newly developed program based on sequence divergence, taking un-aligned regions into consideration. The system was found to be highly accurate: its application to a set of proteins with experimentally verified ID regions had an error rate as low as 2%. Application of this system to human TFs (401 proteins) showed that 38% of the residues were in structural domains, while 62% were in ID regions. The preponderance of ID regions makes a sharp contrast to TFs of Escherichia coli (229 proteins), in which only 5% fell in ID regions. The method also revealed that 4.0% and 11.8% of the total length in human and E. coli TFs, respectively, are comprised of structural domains whose structures have not been determined. CONCLUSION The present system verifies that sequence divergence including information of unaligned regions is a good indicator of ID regions. The system for the first time estimates the complete fractioning of structured/un-structured regions in human TFs, also revealing structural domains without homology to known structures. These predicted novel structural domains are good targets of structural genomics. When applied to other proteins, the system is expected to uncover more novel structural domains.
Collapse
Affiliation(s)
- Satoshi Fukuchi
- Center for Information Biology & DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan.
| | | | | | | | | |
Collapse
|
97
|
Samudrala R, Heffron F, McDermott JE. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathog 2009; 5:e1000375. [PMID: 19390620 PMCID: PMC2668754 DOI: 10.1371/journal.ppat.1000375] [Citation(s) in RCA: 156] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2008] [Accepted: 03/11/2009] [Indexed: 11/18/2022] Open
Abstract
The type III secretion system is an essential component for virulence in many Gram-negative bacteria. Though components of the secretion system apparatus are conserved, its substrates--effector proteins--are not. We have used a novel computational approach to confidently identify new secreted effectors by integrating protein sequence-based features, including evolutionary measures such as the pattern of homologs in a range of other organisms, G+C content, amino acid composition, and the N-terminal 30 residues of the protein sequence. The method was trained on known effectors from the plant pathogen Pseudomonas syringae and validated on a set of effectors from the animal pathogen Salmonella enterica serovar Typhimurium (S. Typhimurium) after eliminating effectors with detectable sequence similarity. We show that this approach can predict known secreted effectors with high specificity and sensitivity. Furthermore, by considering a large set of effectors from multiple organisms, we computationally identify a common putative secretion signal in the N-terminal 20 residues of secreted effectors. This signal can be used to discriminate 46 out of 68 total known effectors from both organisms, suggesting that it is a real, shared signal applicable to many type III secreted effectors. We use the method to make novel predictions of secreted effectors in S. Typhimurium, some of which have been experimentally validated. We also apply the method to predict secreted effectors in the genetically intractable human pathogen Chlamydia trachomatis, identifying the majority of known secreted proteins in addition to providing a number of novel predictions. This approach provides a new way to identify secreted effectors in a broad range of pathogenic bacteria for further experimental characterization and provides insight into the nature of the type III secretion signal.
Collapse
Affiliation(s)
- Ram Samudrala
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Fred Heffron
- Department of Molecular Microbiology and Immunology, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Jason E. McDermott
- Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, Washington, United States of America
- * E-mail:
| |
Collapse
|
98
|
Kodama Y, Tamura T, Hirasawa W, Nakamura K, Sano H. A novel protein phosphorylation pathway involved in osmotic-stress response in tobacco plants. Biochimie 2009; 91:533-9. [PMID: 19340923 DOI: 10.1016/j.biochi.2009.01.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Osmotic stress is one of the severest environmental pressures for plants, commonly occurring under natural growing condition due to drought, salinity, cold and wounding. Plants sensitively respond to these stresses by activating a set of genes, which encode proteins necessary to overcome the crises. We screened such genes from tobacco plants, and identified a particular clone, which encoded a 45 kDa protein kinase belonging to the plant receptor-like cytoplasmic protein kinase class-VII, NAK (novel Arabidopsis protein kinase) group. The clone was consequently designated as NtNAK (Nicotiana tabacum NAK, accession number: DQ447159). GFP-NtNAK fusion protein was localized in both cytoplasm and nucleus, and bacterially expressed NtNAK exhibited in vitro kinase activity. Its transcripts were clearly induced upon treatments of leaves with salt, mannitol, low temperature and also with abscisic and jasmonic acids and ethylene. These properties indicated NtNAK to be a typical osmo-stress-responsive protein kinase. Its target protein(s) were then screened by the yeast two-hybrid system, and one clone encoding a 32 kDa protein was identified. The protein resembled a potato stress-responsive protein CK251806, and designated as NtCK25 (accession number: DQ448851). Bacterially expressed NtCK25 was phosphorylated by NtNAK, and NtCK25-GFP fusion protein was exclusively localized in nucleus. The structure of NtCK25 was found to be similar to a human nuclear body protein, SP110, which is involved in DNA/protein binding regulation. This suggested that, perceiving osmo-stress signal, NtNAK phosphorylates and activates NtCK25, which might function in regulation of nucleus function. The present study thus suggests that NtNAK/NtCK25 constitutes a novel phosphorylation pathway for osmotic-stress response in plants.
Collapse
Affiliation(s)
- Yutaka Kodama
- Research and Education Center for Genetic Information, Nara Institute of Science and Technology, Nara 630-0192, Japan
| | | | | | | | | |
Collapse
|
99
|
Takahashi M, Mizuguchi M, Shinoda H, Aizawa T, Demura M, Okazawa H, Kawano K. Polyglutamine tract binding protein-1 is an intrinsically unstructured protein. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2009; 1794:936-43. [PMID: 19303059 DOI: 10.1016/j.bbapap.2009.03.001] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2008] [Revised: 02/27/2009] [Accepted: 03/03/2009] [Indexed: 12/24/2022]
Abstract
Polyglutamine tract binding protein-1 (PQBP-1) is a nuclear protein that interacts with disease proteins containing expanded polyglutamine repeats. PQBP-1 also interacts with RNA polymerase II and a spliceosomal protein U5-15kD. In the present study, we demonstrate that PQBP-1 is composed of a large unstructured region and a small folded core. Intriguingly, the large unstructured region encompasses two functional domains: a polar amino acid rich domain and a C-terminal domain. These findings suggest that PQBP-1 belongs to the family of intrinsically unstructured/disordered proteins. Furthermore, the binding of the target molecule U5-15kD induces only minor conformational changes into PQBP-1. Our results suggest that PQBP-1 includes high content of unstructured regions in the C-terminal domain, in spite of the binding of U5-15kD.
Collapse
Affiliation(s)
- Masaki Takahashi
- Faculty of Pharmaceutical Sciences, University of Toyama, 2630, Sugitani, Toyama 930-0194, Japan
| | | | | | | | | | | | | |
Collapse
|
100
|
Intrinsic disorder in protein interactions: insights from a comprehensive structural analysis. PLoS Comput Biol 2009; 5:e1000316. [PMID: 19282967 PMCID: PMC2646137 DOI: 10.1371/journal.pcbi.1000316] [Citation(s) in RCA: 90] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2008] [Accepted: 02/03/2009] [Indexed: 01/08/2023] Open
Abstract
We perform a large-scale study of intrinsically disordered regions in proteins and protein complexes using a non-redundant set of hundreds of different protein complexes. In accordance with the conventional view that folding and binding are coupled, in many of our cases the disorder-to-order transition occurs upon complex formation and can be localized to binding interfaces. Moreover, analysis of disorder in protein complexes depicts a significant fraction of intrinsically disordered regions, with up to one third of all residues being disordered. We find that the disorder in homodimers, especially in symmetrical homodimers, is significantly higher than in heterodimers and offer an explanation for this interesting phenomenon. We argue that the mechanisms of regulation of binding specificity through disordered regions in complexes can be as common as for unbound monomeric proteins. The fascinating diversity of roles of disordered regions in various biological processes and protein oligomeric forms shown in our study may be a subject of future endeavors in this area.
Collapse
|