1
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
2
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
3
|
Dumitrescu A, Jokinen E, Paatero A, Kellosalo J, Paavilainen VO, Lähdesmäki H. TSignal: a transformer model for signal peptide prediction. Bioinformatics 2023; 39:i347-i356. [PMID: 37387131 DOI: 10.1093/bioinformatics/btad228] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Signal peptides (SPs) are short amino acid segments present at the N-terminus of newly synthesized proteins that facilitate protein translocation into the lumen of the endoplasmic reticulum, after which they are cleaved off. Specific regions of SPs influence the efficiency of protein translocation, and small changes in their primary structure can abolish protein secretion altogether. The lack of conserved motifs across SPs, sensitivity to mutations, and variability in the length of the peptides make SP prediction a challenging task that has been extensively pursued over the years. RESULTS We introduce TSignal, a deep transformer-based neural network architecture that utilizes BERT language models and dot-product attention techniques. TSignal predicts the presence of SPs and the cleavage site between the SP and the translocated mature protein. We use common benchmark datasets and show competitive accuracy in terms of SP presence prediction and state-of-the-art accuracy in terms of cleavage site prediction for most of the SP types and organism groups. We further illustrate that our fully data-driven trained model identifies useful biological information on heterogeneous test sequences. AVAILABILITY AND IMPLEMENTATION TSignal is available at: https://github.com/Dumitrescu-Alexandru/TSignal.
Collapse
Affiliation(s)
- Alexandru Dumitrescu
- Department of Computer Science, Aalto University, Espoo 02150, Finland
- Institute of Biotechnology, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Emmi Jokinen
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Anja Paatero
- Institute of Biotechnology, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Juho Kellosalo
- Institute of Biotechnology, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Ville O Paavilainen
- Institute of Biotechnology, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| |
Collapse
|
4
|
Hu Z, Yu C, Furutsuki M, Andreoletti G, Ly M, Hoskins R, Adhikari AN, Brenner SE. VIPdb, a genetic Variant Impact Predictor Database. Hum Mutat 2019; 40:1202-1214. [PMID: 31283070 PMCID: PMC7288905 DOI: 10.1002/humu.23858] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 06/27/2019] [Indexed: 12/30/2022]
Abstract
Genome sequencing identifies vast number of genetic variants. Predicting these variants' molecular and clinical effects is one of the preeminent challenges in human genetics. Accurate prediction of the impact of genetic variants improves our understanding of how genetic information is conveyed to molecular and cellular functions, and is an essential step towards precision medicine. Over one hundred tools/resources have been developed specifically for this purpose. We summarize these tools as well as their characteristics, in the genetic Variant Impact Predictor Database (VIPdb). This database will help researchers and clinicians explore appropriate tools, and inform the development of improved methods. VIPdb can be browsed and downloaded at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Changhua Yu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Department of Bioengineering, University of California, Berkeley, California 94720, USA
| | - Mabel Furutsuki
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Gaia Andreoletti
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Melissa Ly
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Division of Data Sciences, University of California, Berkeley, California 94720, USA
| | - Roger Hoskins
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Aashish N. Adhikari
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
5
|
Abstract
It is well-established that dynamics are central to protein function; their importance is implicitly acknowledged in the principles of the Monod, Wyman and Changeux model of binding cooperativity, which was originally proposed in 1965. Nowadays the concept of protein dynamics is formulated in terms of the energy landscape theory, which can be used to understand protein folding and conformational changes in proteins. Because protein dynamics are so important, a key to understanding protein function at the molecular level is to design experiments that allow their quantitative analysis. Nuclear magnetic resonance (NMR) spectroscopy is uniquely suited for this purpose because major advances in theory, hardware, and experimental methods have made it possible to characterize protein dynamics at an unprecedented level of detail. Unique features of NMR include the ability to quantify dynamics (i) under equilibrium conditions without external perturbations, (ii) using many probes simultaneously, and (iii) over large time intervals. Here we review NMR techniques for quantifying protein dynamics on fast (ps-ns), slow (μs-ms), and very slow (s-min) time scales. These techniques are discussed with reference to some major discoveries in protein science that have been made possible by NMR spectroscopy.
Collapse
|
6
|
Dee DR, Horimoto Y, Yada RY. Conserved prosegment residues stabilize a late-stage folding transition state of pepsin independently of ground states. PLoS One 2014; 9:e101339. [PMID: 24983988 PMCID: PMC4077824 DOI: 10.1371/journal.pone.0101339] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2013] [Accepted: 06/05/2014] [Indexed: 11/24/2022] Open
Abstract
The native folding of certain zymogen-derived enzymes is completely dependent upon a prosegment domain to stabilize the folding transition state, thereby catalyzing the folding reaction. Generally little is known about how the prosegment accomplishes this task. It was previously shown that the prosegment catalyzes a late-stage folding transition between a stable misfolded state and the native state of pepsin. In this study, the contributions of specific prosegment residues to catalyzing pepsin folding were investigated by introducing individual Ala substitutions and measuring the effects on the bimolecular folding reaction between the prosegment peptide and pepsin. The effects of mutations on the free energies of the individual misfolded and native ground states and the transition state were compared using measurements of prosegment-pepsin binding and folding kinetics. Five out of the seven prosegment residues examined yielded relatively large kinetic effects and minimal ground state perturbations upon mutation, findings which indicate that these residues form strengthened and/or non-native contacts in the transition state. These five residues are semi- to strictly conserved, while only a non-conserved residue had no kinetic effect. One conserved residue was shown to form native structure in the transition state. These results indicated that the prosegment, which is only 44 residues long, has evolved a high density of contacts that preferentially stabilize the folding transition state over the ground states. It is postulated that the prosegment forms extensive non-native contacts during the process of catalyzing correct inter- and intra-domain contacts during the final stages of folding. These results have implications for understanding the folding of multi-domain proteins and for the evolution of prosegment-catalyzed folding.
Collapse
Affiliation(s)
- Derek R. Dee
- Biophysics Interdepartmental Group, University of Guelph, Guelph, Ontario, Canada
| | - Yasumi Horimoto
- Department of Food Science, University of Guelph, Guelph, Ontario, Canada
| | - Rickey Y. Yada
- Biophysics Interdepartmental Group, University of Guelph, Guelph, Ontario, Canada
- Department of Food Science, University of Guelph, Guelph, Ontario, Canada
- * E-mail:
| |
Collapse
|
7
|
Computational and experimental approaches to reveal the effects of single nucleotide polymorphisms with respect to disease diagnostics. Int J Mol Sci 2014; 15:9670-717. [PMID: 24886813 PMCID: PMC4100115 DOI: 10.3390/ijms15069670] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Revised: 05/15/2014] [Accepted: 05/16/2014] [Indexed: 12/25/2022] Open
Abstract
DNA mutations are the cause of many human diseases and they are the reason for natural differences among individuals by affecting the structure, function, interactions, and other properties of DNA and expressed proteins. The ability to predict whether a given mutation is disease-causing or harmless is of great importance for the early detection of patients with a high risk of developing a particular disease and would pave the way for personalized medicine and diagnostics. Here we review existing methods and techniques to study and predict the effects of DNA mutations from three different perspectives: in silico, in vitro and in vivo. It is emphasized that the problem is complicated and successful detection of a pathogenic mutation frequently requires a combination of several methods and a knowledge of the biological phenomena associated with the corresponding macromolecules.
Collapse
|
8
|
Chang CCH, Tey BT, Song J, Ramanan RN. Towards more accurate prediction of protein folding rates: a review of the existing web-based bioinformatics approaches. Brief Bioinform 2014; 16:314-24. [DOI: 10.1093/bib/bbu007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
|
9
|
|
10
|
Arun PVPS, Bakku RK, Subhashini M, Singh P, Prabhu NP, Suzuki I, Prakash JSS. CyanoPhyChe: a database for physico-chemical properties, structure and biochemical pathway information of cyanobacterial proteins. PLoS One 2012. [PMID: 23185330 PMCID: PMC3504015 DOI: 10.1371/journal.pone.0049425] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
CyanoPhyChe is a user friendly database that one can browse through for physico-chemical properties, structure and biochemical pathway information of cyanobacterial proteins. We downloaded all the protein sequences from the cyanobacterial genome database for calculating the physico-chemical properties, such as molecular weight, net charge of protein, isoelectric point, molar extinction coefficient, canonical variable for solubility, grand average hydropathy, aliphatic index, and number of charged residues. Based on the physico-chemical properties, we provide the polarity, structural stability and probability of a protein entering in to an inclusion body (PEPIB). We used the data generated on physico-chemical properties, structure and biochemical pathway information of all cyanobacterial proteins to construct CyanoPhyChe. The data can be used for optimizing methods of expression and characterization of cyanobacterial proteins. Moreover, the ‘Search’ and data export options provided will be useful for proteome analysis. Secondary structure was predicted for all the cyanobacterial proteins using PSIPRED tool and the data generated is made accessible to researchers working on cyanobacteria. In addition, external links are provided to biological databases such as PDB and KEGG for molecular structure and biochemical pathway information, respectively. External links are also provided to different cyanobacterial databases. CyanoPhyChe can be accessed from the following URL: http://bif.uohyd.ac.in/cpc.
Collapse
Affiliation(s)
- P. V. Parvati Sai Arun
- Department of Plant Sciences, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
| | - Ranjith Kumar Bakku
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
| | - Mranu Subhashini
- Department of Plant Sciences, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
| | - Pankaj Singh
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
| | - N. Prakash Prabhu
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
| | - Iwane Suzuki
- Faculty of Life and Environmental Science, University of Tsukuba, Tsukuba, Japan
| | - Jogadhenu S. S. Prakash
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Andhra Pradesh, India
- * E-mail:
| |
Collapse
|
11
|
Qi Y, Oja M, Weston J, Noble WS. A unified multitask architecture for predicting local protein properties. PLoS One 2012; 7:e32235. [PMID: 22461885 PMCID: PMC3312883 DOI: 10.1371/journal.pone.0032235] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 01/25/2012] [Indexed: 01/27/2023] Open
Abstract
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
Collapse
Affiliation(s)
- Yanjun Qi
- Machine Learning Department, NEC Labs America, Princeton, New Jersey, United States of America
| | - Merja Oja
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Jason Weston
- Google, New York, New York, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
12
|
Wishart DS. Interpreting protein chemical shift data. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2011; 58:62-87. [PMID: 21241884 DOI: 10.1016/j.pnmrs.2010.07.004] [Citation(s) in RCA: 191] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Accepted: 07/29/2010] [Indexed: 05/12/2023]
Affiliation(s)
- David S Wishart
- Department of Biological Sciences, National Institute for Nanotechnology (NINT), Edmonton, AB, Canada T6G 2E8.
| |
Collapse
|
13
|
Berjanskii M, Tang P, Liang J, Cruz JA, Zhou J, Zhou Y, Bassett E, MacDonell C, Lu P, Lin G, Wishart DS. GeNMR: a web server for rapid NMR-based protein structure determination. Nucleic Acids Res 2009; 37:W670-7. [PMID: 19406927 PMCID: PMC2703936 DOI: 10.1093/nar/gkp280] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
GeNMR (GEnerate NMR structures) is a web server for rapidly generating accurate 3D protein structures using sequence data, NOE-based distance restraints and/or NMR chemical shifts as input. GeNMR accepts distance restraints in XPLOR or CYANA format as well as chemical shift files in either SHIFTY or BMRB formats. The web server produces an ensemble of PDB coordinates for the protein within 15-25 min, depending on model complexity and completeness of experimental restraints. GeNMR uses a pipeline of several pre-existing programs and servers to calculate the actual protein structure. In particular, GeNMR combines genetic algorithms for structure optimization along with homology modeling, chemical shift threading, torsion angle and distance predictions from chemical shifts/NOEs as well as ROSETTA-based structure generation and simulated annealing with XPLOR-NIH to generate and/or refine protein coordinates. GeNMR greatly simplifies the task of protein structure determination as users do not have to install or become familiar with complex stand-alone programs or obscure format conversion utilities. Tests conducted on a sample of 90 proteins from the BioMagResBank indicate that GeNMR produces high-quality models for all protein queries, regardless of the type of NMR input data. GeNMR was developed to facilitate rapid, user-friendly structure determination of protein structures via NMR spectroscopy. GeNMR is accessible at http://www.genmr.ca.
Collapse
Affiliation(s)
- Mark Berjanskii
- Department of Computing Science, University of Alberta and National Research Council, National Institute for Nanotechnology, Edmonton, AB, Canada T6G 2E8
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Frank K, Sippl MJ. High-performance signal peptide prediction based on sequence alignment techniques. Bioinformatics 2008; 24:2172-6. [PMID: 18697773 DOI: 10.1093/bioinformatics/btn422] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED The accuracy of current signal peptide predictors is outstanding. The most successful predictors are based on neural networks and hidden Markov models, reaching a sensitivity of 99% and an accuracy of 95%. Here, we demonstrate that the popular BLASTP alignment tool can be tuned for signal peptide prediction reaching the same high level of prediction success. Alignment-based techniques provide additional benefits. In spite of high success rates signal peptide predictors yield false predictions. Simple sequences like polyvaline, for example, are predicted as signal peptides. The general architecture of learning systems makes it difficult to trace the cause of such problems. This kind of false predictions can be recognized or avoided altogether by using sequence comparison techniques. Based on these results we have implemented a public web service, called Signal-BLAST. Predictions returned by Signal-BLAST are transparent and easy to analyze. AVAILABILITY Signal-BLAST is available online at http://sigpep.services.came.sbg.ac.at/signalblast.html.
Collapse
Affiliation(s)
- Karl Frank
- Center of Applied Molecular Engineering, University of Salzburg, Jakob-Haringerstrasse 5, 5020 Salzburg, Austria
| | | |
Collapse
|
15
|
Shi Y, Zhou J, Arndt D, Wishart DS, Lin G. Protein contact order prediction from primary sequences. BMC Bioinformatics 2008; 9:255. [PMID: 18513429 PMCID: PMC2440764 DOI: 10.1186/1471-2105-9-255] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2007] [Accepted: 05/30/2008] [Indexed: 11/11/2022] Open
Abstract
Background Contact order is a topological descriptor that has been shown to be correlated with several interesting protein properties such as protein folding rates and protein transition state placements. Contact order has also been used to select for viable protein folds from ab initio protein structure prediction programs. For proteins of known three-dimensional structure, their contact order can be calculated directly. However, for proteins with unknown three-dimensional structure, there is no effective prediction method currently available. Results In this paper, we propose several simple yet very effective methods to predict contact order from the amino acid sequence only. One set of methods is based on a weighted linear combination of predicted secondary structure content and amino acid composition. Depending on the number of components used in these equations it is possible to achieve a correlation coefficient of 0.857–0.870 between the observed and predicted contact order. A second method, based on sequence similarity to known three-dimensional structures, is able to achieve a correlation coefficient of 0.977. We have also developed a much more robust implementation for calculating contact order directly from PDB coordinates that works for > 99% PDB files. All of these contact order predictors and calculators have been implemented as a web server (see Availability and requirements section for URL). Conclusion Protein contact order can be effectively predicted from the primary sequence, at the absence of three-dimensional structure. Three factors, percentage of residues in alpha helices, percentage of residues in beta strands, and sequence length, appear to be strongly correlated with the absolute contact order.
Collapse
Affiliation(s)
- Yi Shi
- Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada.
| | | | | | | | | |
Collapse
|
16
|
Wishart DS, Arndt D, Berjanskii M, Tang P, Zhou J, Lin G. CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data. Nucleic Acids Res 2008; 36:W496-502. [PMID: 18515350 PMCID: PMC2447725 DOI: 10.1093/nar/gkn305] [Citation(s) in RCA: 168] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
CS23D (chemical shift to 3D structure) is a web server for rapidly generating accurate 3D protein structures using only assigned nuclear magnetic resonance (NMR) chemical shifts and sequence data as input. Unlike conventional NMR methods, CS23D requires no NOE and/or J-coupling data to perform its calculations. CS23D accepts chemical shift files in either SHIFTY or BMRB formats, and produces a set of PDB coordinates for the protein in about 10-15 min. CS23D uses a pipeline of several preexisting programs or servers to calculate the actual protein structure. Depending on the sequence similarity (or lack thereof) CS23D uses either (i) maximal subfragment assembly (a form of homology modeling), (ii) chemical shift threading or (iii) shift-aided de novo structure prediction (via Rosetta) followed by chemical shift refinement to generate and/or refine protein coordinates. Tests conducted on more than 100 proteins from the BioMagResBank indicate that CS23D converges (i.e. finds a solution) for >95% of protein queries. These chemical shift generated structures were found to be within 0.2-2.8 A RMSD of the NMR structure generated using conventional NOE-base NMR methods or conventional X-ray methods. The performance of CS23D is dependent on the completeness of the chemical shift assignments and the similarity of the query protein to known 3D folds. CS23D is accessible at http://www.cs23d.ca.
Collapse
Affiliation(s)
- David S Wishart
- Department of Computing Science, Department of Biological Sciences, University of Alberta and National Research Council, National Institute for Nanotechnology (NINT), Edmonton, AB, Canada T6G 2E8
| | | | | | | | | | | |
Collapse
|
17
|
Montgomerie S, Cruz JA, Shrivastava S, Arndt D, Berjanskii M, Wishart DS. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Res 2008; 36:W202-9. [PMID: 18483082 PMCID: PMC2447806 DOI: 10.1093/nar/gkn255] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
PROTEUS2 is a web server designed to support comprehensive protein structure prediction and structure-based annotation. PROTEUS2 accepts either single sequences (for directed studies) or multiple sequences (for whole proteome annotation) and predicts the secondary and, if possible, tertiary structure of the query protein(s). Unlike most other tools or servers, PROTEUS2 bundles signal peptide identification, transmembrane helix prediction, transmembrane β-strand prediction, secondary structure prediction (for soluble proteins) and homology modeling (i.e. 3D structure generation) into a single prediction pipeline. Using a combination of progressive multi-sequence alignment, structure-based mapping, hidden Markov models, multi-component neural nets and up-to-date databases of known secondary structure assignments, PROTEUS is able to achieve among the highest reported levels of predictive accuracy for signal peptides (Q2 = 94%), membrane spanning helices (Q2 = 87%) and secondary structure (Q3 score of 81.3%). PROTEUS2's homology modeling services also provide high quality 3D models that compare favorably with those generated by SWISS-MODEL and 3D JigSaw (within 0.2 Å RMSD). The average PROTEUS2 prediction takes ∼3 min per query sequence. The PROTEUS2 server along with source code for many of its modules is accessible a http://wishart.biology.ualberta.ca/proteus2.
Collapse
Affiliation(s)
- Scott Montgomerie
- Department of Computing Science and Department of Biological Sciences, University of Alberta and National Research Council, National Institute for Nanotechnology (NINT), Edmonton, AB, Canada T6G 2E8
| | | | | | | | | | | |
Collapse
|