1
|
Sandhya S, Mudgal R, Kumar G, Sowdhamini R, Srinivasan N. Protein sequence design and its applications. Curr Opin Struct Biol 2016; 37:71-80. [PMID: 26773478 DOI: 10.1016/j.sbi.2015.12.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 12/07/2015] [Accepted: 12/15/2015] [Indexed: 01/14/2023]
Abstract
Design of proteins has far-reaching potentials in diverse areas that span repurposing of the protein scaffold for reactions and substrates that they were not naturally meant for, to catching a glimpse of the ephemeral proteins that nature might have sampled during evolution. These non-natural proteins, either in synthesized or virtual form have opened the scope for the design of entities that not only rival their natural counterparts but also offer a chance to visualize the protein space continuum that might help to relate proteins and understand their associations. Here, we review the recent advances in protein engineering and design, in multiple areas, with a view to drawing attention to their future potential.
Collapse
Affiliation(s)
- Sankaran Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - Richa Mudgal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India; IISc Mathematics Initiative, Indian Institute of Science, Bangalore 560 012, India
| | - Gayatri Kumar
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences-TIFR, UAS-GKVK Campus, Bangalore 560065, India
| | | |
Collapse
|
2
|
Bedoya O, Tischer I. Reducing dimensionality in remote homology detection using predicted contact maps. Comput Biol Med 2015; 59:64-72. [DOI: 10.1016/j.compbiomed.2015.01.020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Revised: 01/05/2015] [Accepted: 01/22/2015] [Indexed: 11/28/2022]
|
3
|
Daniels NM, Gallant A, Ramsey N, Cowen LJ. MRFy: Remote Homology Detection for Beta-Structural Proteins Using Markov Random Fields and Stochastic Search. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:4-16. [PMID: 26357074 DOI: 10.1109/tcbb.2014.2344682] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We introduce MRFy, a tool for protein remote homology detection that captures beta-strand dependencies in the Markov random field. Over a set of 11 SCOP beta-structural superfamilies, MRFy shows a 14 percent improvement in mean Area Under the Curve for the motif recognition problem as compared to HMMER, 25 percent improvement as compared to RAPTOR, 14 percent improvement as compared to HHPred, and a 18 percent improvement as compared to CNFPred and RaptorX. MRFy was implemented in the Haskell functional programming language, and parallelizes well on multi-core systems. MRFy is available, as source code as well as an executable, from http://mrfy.cs.tufts.edu/.
Collapse
|
4
|
Song T, Gu H. Discriminative motif discovery via simulated evolution and random under-sampling. PLoS One 2014; 9:e87670. [PMID: 24551063 PMCID: PMC3923751 DOI: 10.1371/journal.pone.0087670] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Accepted: 12/29/2013] [Indexed: 11/22/2022] Open
Abstract
Conserved motifs in biological sequences are closely related to their structure and functions. Recently, discriminative motif discovery methods have attracted more and more attention. However, little attention has been devoted to the data imbalance problem, which is one of the main reasons affecting the performance of the discriminative models. In this article, a simulated evolution method is applied to solve the multi-class imbalance problem at the stage of data preprocessing, and at the stage of Hidden Markov Models (HMMs) training, a random under-sampling method is introduced for the imbalance between the positive and negative datasets. It is shown that, in the task of discovering targeting motifs of nine subcellular compartments, the motifs found by our method are more conserved than the methods without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, we use the found motifs to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes.
Collapse
Affiliation(s)
- Tao Song
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hong Gu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
- * E-mail:
| |
Collapse
|
5
|
Abstract
MOTIVATION The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. RESULTS We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP's runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search. AVAILABILITY CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/ CONTACT bab@mit.edu.
Collapse
Affiliation(s)
- Noah M Daniels
- Department of Computer Science, Tufts University, Medford, MA 02451, USA
| | | | | | | | | | | |
Collapse
|
6
|
Lee C, Huang CH. LASAGNA: a novel algorithm for transcription factor binding site alignment. BMC Bioinformatics 2013; 14:108. [PMID: 23522376 PMCID: PMC3747862 DOI: 10.1186/1471-2105-14-108] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2012] [Accepted: 03/08/2013] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of the resolutions of assays used to obtain TFBSs, databases such as TRANSFAC, ORegAnno and PAZAR store unaligned variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a PSSM. While the TRANSFAC database provides scoring matrices for TFs, nearly 78% of the TFs in the public release do not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an alignment algorithm tailored to TFBSs. RESULTS We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes position dependence. Results on 189 TFs of 5 species in the TRANSFAC database showed that our method significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it showed comparable performance with MEME in discovering motifs in ChIP-seq peak sequences. CONCLUSIONS We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search. The tool currently stores precomputed PSSM models for 189 TFs and 133 TFs built from TFBSs in the TRANSFAC Public database (release 7.0) and the ORegAnno database (08Nov10 dump), respectively. The webtool is available at http://biogrid.engr.uconn.edu/lasagna_search/.
Collapse
Affiliation(s)
- Chih Lee
- Department of Computer Science and Engineering, University of Connecticut,
Fairfield Road, Storrs, CT 06269, USA
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut,
Fairfield Road, Storrs, CT 06269, USA
| |
Collapse
|
7
|
Daniels NM, Nadimpalli S, Cowen LJ. Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment. BMC Bioinformatics 2012; 13:259. [PMID: 23039758 PMCID: PMC3585936 DOI: 10.1186/1471-2105-13-259] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2012] [Accepted: 10/01/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The quality of multiple protein structure alignments are usually computed and assessed based on geometric functions of the coordinates of the backbone atoms from the protein chains. These purely geometric methods do not utilize directly protein sequence similarity, and in fact, determining the proper way to incorporate sequence similarity measures into the construction and assessment of protein multiple structure alignments has proved surprisingly difficult. RESULTS We present Formatt, a multiple structure alignment based on the Matt purely geometric multiple structure alignment program, that also takes into account sequence similarity when constructing alignments. We show that Formatt outperforms Matt and other popular structure alignment programs on the popular HOMSTRAD benchmark. For the SABMark twilight zone benchmark set that captures more remote homology, Formatt and Matt outperform other programs; depending on choice of embedded sequence aligner, Formatt produces either better sequence and structural alignments with a smaller core size than Matt, or similarly sized alignments with better sequence similarity, for a small cost in average RMSD. CONCLUSIONS Considering sequence information as well as purely geometric information seems to improve quality of multiple structure alignments, though defining what constitutes the best alignment when sequence and structural measures would suggest different alignments remains a difficult open question.
Collapse
Affiliation(s)
- Noah M Daniels
- Department of Computer Science, Tufts University, 161 College Ave, Medford, 02155, MA, USA
| | - Shilpa Nadimpalli
- Department of Computer Science, Princeton University, 35 Olden St, Princeton, 08540, NJ, USA
| | - Lenore J Cowen
- Department of Computer Science, Tufts University, 161 College Ave, Medford, 02155, MA, USA
| |
Collapse
|
8
|
A computational framework for boosting confidence in high-throughput protein-protein interaction datasets. Genome Biol 2012; 13:R76. [PMID: 22937800 PMCID: PMC4053744 DOI: 10.1186/gb-2012-13-8-r76] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2012] [Accepted: 08/31/2012] [Indexed: 12/28/2022] Open
Abstract
Improving the quality and coverage of the protein interactome is of tantamount importance for biomedical research, particularly given the various sources of uncertainty in high-throughput techniques. We introduce a structure-based framework, Coev2Net, for computing a single confidence score that addresses both false-positive and false-negative rates. Coev2Net is easily applied to thousands of binary protein interactions and has superior predictive performance over existing methods. We experimentally validate selected high-confidence predictions in the human MAPK network and show that predicted interfaces are enriched for cancer -related or damaging SNPs. Coev2Net can be downloaded at http://struct2net.csail.mit.edu.
Collapse
|
9
|
Joo H, Chavan AG, Phan J, Day R, Tsai J. An amino acid packing code for α-helical structure and protein design. J Mol Biol 2012; 419:234-54. [PMID: 22426125 DOI: 10.1016/j.jmb.2012.03.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2011] [Revised: 02/22/2012] [Accepted: 03/07/2012] [Indexed: 11/19/2022]
Abstract
This work demonstrates that all packing in α-helices can be simplified to repetitive patterns of a single motif: the knob-socket. Using the precision of Voronoi Polyhedra/Delauney Tessellations to identify contacts, the knob-socket is a four-residue tetrahedral motif: a knob residue on one α-helix packs into the three-residue socket on another α-helix. The principle of the knob-socket model relates the packing between levels of protein structure: the intra-helical packing arrangements within secondary structure that permit inter-helix tertiary packing interactions. Within an α-helix, the three-residue sockets arrange residues into a uniform packing lattice. Inter-helix packing results from a definable pattern of interdigitated knob-socket motifs between two α-helices. Furthermore, the knob-socket model classifies three types of sockets: (1) free, favoring only intra-helical packing; (2) filled, favoring inter-helical interactions; and (3) non, disfavoring α-helical structure. The amino acid propensities in these three socket classes essentially represent an amino acid code for structure in α-helical packing. Using this code, we used a novel yet straightforward approach for the design of α-helical structure to validate the knob-socket model. Unique sequences for three peptides were created to produce a predicted amount of α-helical structure: mostly helical, some helical, and no helix. These three peptides were synthesized, and helical content was assessed using CD spectroscopy. The measured α-helicity of each peptide was consistent with the expected predictions. These results and analysis demonstrate that the knob-socket motif functions as the basic unit of packing and presents an intuitive tool to decipher the rules governing packing in protein structure.
Collapse
Affiliation(s)
- Hyun Joo
- Department of Chemistry, University of the Pacific, Stockton, CA 95211, USA
| | | | | | | | | |
Collapse
|
10
|
Daniels NM, Hosur R, Berger B, Cowen LJ. SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone. Bioinformatics 2012; 28:1216-22. [PMID: 22408192 PMCID: PMC3338012 DOI: 10.1093/bioinformatics/bts110] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation: One of the most successful methods to date for recognizing protein sequences that are evolutionarily related has been profile hidden Markov models (HMMs). However, these models do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta sheets. These dependencies have been partially captured in the HMM setting by simulated evolution in the training phase and can be fully captured by Markov random fields (MRFs). However, the MRFs can be computationally prohibitive when beta strands are interleaved in complex topologies. We introduce SMURFLite, a method that combines both simplified MRFs and simulated evolution to substantially improve remote homology detection for beta structures. Unlike previous MRF-based methods, SMURFLite is computationally feasible on any beta-structural motif. Results: We test SMURFLite on all propeller and barrel folds in the mainly-beta class of the SCOP hierarchy in stringent cross-validation experiments. We show a mean 26% (median 16%) improvement in area under curve (AUC) for beta-structural motif recognition as compared with HMMER (a well-known HMM method) and a mean 33% (median 19%) improvement as compared with RAPTOR (a well-known threading method) and even a mean 18% (median 10%) improvement in AUC over HHPred (a profile–profile HMM method), despite HHpred's use of extensive additional training data. We demonstrate SMURFLite's ability to scale to whole genomes by running a SMURFLite library of 207 beta-structural SCOP superfamilies against the entire genome of Thermotoga maritima, and make over a 100 new fold predictions. Availability and implementaion: A webserver that runs SMURFLite is available at: http://smurf.cs.tufts.edu/smurflite/ Contact:lenore.cowen@tufts.edu; bab@mit.edu
Collapse
Affiliation(s)
- Noah M Daniels
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | | | | | | |
Collapse
|