1
|
The Gene Sculpt Suite: a set of tools for genome editing. Nucleic Acids Res 2020; 47:W175-W182. [PMID: 31127311 PMCID: PMC6602503 DOI: 10.1093/nar/gkz405] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2019] [Revised: 04/30/2019] [Accepted: 05/02/2019] [Indexed: 12/26/2022] Open
Abstract
The discovery and development of DNA-editing nucleases (Zinc Finger Nucleases, TALENs, CRISPR/Cas systems) has given scientists the ability to precisely engineer or edit genomes as never before. Several different platforms, protocols and vectors for precision genome editing are now available, leading to the development of supporting web-based software. Here we present the Gene Sculpt Suite (GSS), which comprises three tools: (i) GTagHD, which automatically designs and generates oligonucleotides for use with the GeneWeld knock-in protocol; (ii) MEDJED, a machine learning method, which predicts the extent to which a double-stranded DNA break site will utilize the microhomology-mediated repair pathway; and (iii) MENTHU, a tool for identifying genomic locations likely to give rise to a single predominant microhomology-mediated end joining allele (PreMA) repair outcome. All tools in the GSS are freely available for download under the GPL v3.0 license and can be run locally on Windows, Mac and Linux systems capable of running R and/or Docker. The GSS is also freely available online at www.genesculpt.org.
Collapse
|
2
|
Partner-specific prediction of RNA-binding residues in proteins: A critical assessment. Proteins 2018; 87:198-211. [PMID: 30536635 PMCID: PMC6389706 DOI: 10.1002/prot.25639] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 10/10/2018] [Accepted: 11/29/2018] [Indexed: 01/06/2023]
Abstract
RNA-protein interactions play essential roles in regulating gene expression. While some RNA-protein interactions are "specific", that is, the RNA-binding proteins preferentially bind to particular RNA sequence or structural motifs, others are "non-RNA specific." Deciphering the protein-RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein-RNA interfaces, there is a need for computational methods to identify RNA-binding residues in proteins. While most of the existing computational methods for predicting RNA-binding residues in RNA-binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner-specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner-specific protein-RNA interface prediction tools, PS-PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA-specificity metric (RSM), for quantifying the RNA-specificity of the RNA binding residues predicted by such tools. Our results show that the RNA-binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner-agnostic metrics, RNA partner-specific methods are outperformed by the state-of-the-art partner-agnostic methods. We conjecture that either (a) the protein-RNA complexes in PDB are not representative of the protein-RNA interactions in nature, or (b) the current methods for partner-specific prediction of RNA-binding residues in proteins fail to account for the differences in RNA partner-specific versus partner-agnostic protein-RNA interactions, or both.
Collapse
|
3
|
OLDER BLACKS’ EXPERIENCES WITH TRADITIONAL PAPER-AND-PENCIL VERSUS COMPUTERIZED COGNITIVE BATTERIES. Innov Aging 2018. [DOI: 10.1093/geroni/igy023.2417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
4
|
DILEMMAS FACING NURSING HOMES AND ASSISTED LIVING PROVIDERS DURING HURRICANES. Innov Aging 2018. [DOI: 10.1093/geroni/igy023.2341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
5
|
FACTORS ASSOCIATED WITH WHETHER OLDER ADULTS DISCUSS THEIR EOL CARE PREFERENCES WITH FAMILY MEMBERS. Innov Aging 2018. [DOI: 10.1093/geroni/igy023.3141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
6
|
Robust activation of microhomology-mediated end joining for precision gene editing applications. PLoS Genet 2018; 14:e1007652. [PMID: 30208061 PMCID: PMC6152997 DOI: 10.1371/journal.pgen.1007652] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Revised: 09/24/2018] [Accepted: 08/22/2018] [Indexed: 11/18/2022] Open
Abstract
One key problem in precision genome editing is the unpredictable plurality of sequence outcomes at the site of targeted DNA double stranded breaks (DSBs). This is due to the typical activation of the versatile Non-homologous End Joining (NHEJ) pathway. Such unpredictability limits the utility of somatic gene editing for applications including gene therapy and functional genomics. For germline editing work, the accurate reproduction of the identical alleles using NHEJ is a labor intensive process. In this study, we propose Microhomology-mediated End Joining (MMEJ) as a viable solution for improving somatic sequence homogeneity in vivo, capable of generating a single predictable allele at high rates (56% ~ 86% of the entire mutant allele pool). Using a combined dataset from zebrafish (Danio rerio) in vivo and human HeLa cell in vitro, we identified specific contextual sequence determinants surrounding genomic DSBs for robust MMEJ pathway activation. We then applied our observation to prospectively design MMEJ-inducing sgRNAs against a variety of proof-of-principle genes and demonstrated high levels of mutant allele homogeneity. MMEJ-based DNA repair at these target loci successfully generated F0 mutant zebrafish embryos and larvae that faithfully recapitulated previously reported, recessive, loss-of-function phenotypes. We also tested the generalizability of our approach in cultured human cells. Finally, we provide a novel algorithm, MENTHU (http://genesculpt.org/menthu/), for improved and facile prediction of candidate MMEJ loci. We believe that this MMEJ-centric approach will have a broader impact on genome engineering and its applications. For example, whereas somatic mosaicism hinders efficient recreation of knockout mutant allele at base pair resolution via the standard NHEJ-based approach, we demonstrate that F0 founders transmitted the identical MMEJ allele of interest at high rates. Most importantly, the ability to directly dictate the reading frame of an endogenous target will have important implications for gene therapy applications in human genetic diseases.
Collapse
|
7
|
Identification of a homogenous structural basis for oligomerization by retroviral Rev-like proteins. Retrovirology 2017; 14:40. [PMID: 28830558 PMCID: PMC5568270 DOI: 10.1186/s12977-017-0366-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 08/11/2017] [Indexed: 11/17/2022] Open
Abstract
Background Rev-like proteins are post-transcriptional regulatory proteins found in several retrovirus genera, including lentiviruses, betaretroviruses, and deltaretroviruses. These essential proteins mediate the nuclear export of incompletely spliced viral RNA, and act by tethering viral pre-mRNA to the host CRM1 nuclear export machinery. Although all Rev-like proteins are functionally homologous, they share less than 30% sequence identity. In the present study, we computationally assessed the extent of structural homology among retroviral Rev-like proteins within a phylogenetic framework. Results We undertook a comprehensive analysis of overall protein domain architecture and predicted secondary structural features for representative members of the Rev-like family of proteins. Similar patterns of α-helical domains were identified for Rev-like proteins within each genus, with the exception of deltaretroviruses, which were devoid of α-helices. Coiled-coil oligomerization motifs were also identified for most Rev-like proteins, with the notable exceptions of HIV-1, the deltaretroviruses, and some small ruminant lentiviruses. In Rev proteins of primate lentiviruses, the presence of predicted coiled-coil motifs segregated within specific primate lineages: HIV-1 descended from SIVs that lacked predicted coiled-coils in Rev whereas HIV-2 descended from SIVs that contained predicted coiled-coils in Rev. Phylogenetic ancestral reconstruction of coiled-coils for all Rev-like proteins predicted a single origin for the coiled-coil motif, followed by three losses of the predicted signal. The absence of a coiled-coil signal in HIV-1 was associated with replacement of canonical polar residues with non-canonical hydrophobic residues. However, hydrophobic residues were retained in the key ‘a’ and ‘d’ positions, and the α-helical region of HIV-1 Rev oligomerization domain could be modeled as a helical wheel with two predicted interaction interfaces. Moreover, the predicted interfaces mapped to the dimerization and oligomerization interfaces in HIV-1 Rev crystal structures. Helical wheel projections of other retroviral Rev-like proteins, including endogenous sequences, revealed similar interaction interfaces that could mediate oligomerization. Conclusions Sequence-based computational analyses of Rev-like proteins, together with helical wheel projections of oligomerization domains, reveal a conserved homogeneous structural basis for oligomerization by retroviral Rev-like proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12977-017-0366-1) contains supplementary material, which is available to authorized users.
Collapse
|
8
|
Template-based protein-protein docking exploiting pairwise interfacial residue restraints. Brief Bioinform 2017; 18:458-466. [PMID: 27013645 PMCID: PMC5428999 DOI: 10.1093/bib/bbw027] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Revised: 02/03/2016] [Indexed: 01/26/2023] Open
Abstract
Although many advanced and sophisticated ab initio approaches for modeling protein-protein complexes have been proposed in past decades, template-based modeling (TBM) remains the most accurate and widely used approach, given a reliable template is available. However, there are many different ways to exploit template information in the modeling process. Here, we systematically evaluate and benchmark a TBM method that uses conserved interfacial residue pairs as docking distance restraints [referred to as alpha carbon-alpha carbon (CA-CA)-guided docking]. We compare it with two other template-based protein-protein modeling approaches, including a conserved non-pairwise interfacial residue restrained docking approach [referred to as the ambiguous interaction restraint (AIR)-guided docking] and a simple superposition-based modeling approach. Our results show that, for most cases, the CA-CA-guided docking method outperforms both superposition with refinement and the AIR-guided docking method. We emphasize the superiority of the CA-CA-guided docking on cases with medium to large conformational changes, and interactions mediated through loops, tails or disordered regions. Our results also underscore the importance of a proper refinement of superimposition models to reduce steric clashes. In summary, we provide a benchmarked TBM protocol that uses conserved pairwise interface distance as restraints in generating realistic 3D protein-protein interaction models, when reliable templates are available. The described CA-CA-guided docking protocol is based on the HADDOCK platform, which allows users to incorporate additional prior knowledge of the target system to further improve the quality of the resulting models.
Collapse
|
9
|
Abstract
Identifying individual residues in the interfaces of protein-RNA complexes is important for understanding the molecular determinants of protein-RNA recognition and has many potential applications. Recent technical advances have led to several high-throughput experimental methods for identifying partners in protein-RNA complexes, but determining RNA-binding residues in proteins is still expensive and time-consuming. This chapter focuses on available computational methods for identifying which amino acids in an RNA-binding protein participate directly in contacting RNA. Step-by-step protocols for using three different web-based servers to predict RNA-binding residues are described. In addition, currently available web servers and software tools for predicting RNA-binding sites, as well as databases that contain valuable information about known protein-RNA complexes, RNA-binding motifs in proteins, and protein-binding recognition sites in RNA are provided. We emphasize sequence-based methods that can reliably identify interfacial residues without the requirement for structural information regarding either the RNA-binding protein or its RNA partner.
Collapse
|
10
|
Abstract
Experimental methods for identifying protein(s) bound by a specific promoter-associated RNA (paRNA) of interest can be expensive, difficult, and time-consuming. This chapter describes a general computational framework for identifying potential binding partners in RNA-protein complexes or RNA-protein interaction networks. Protocols for using three web-based tools to predict RNA-protein interaction partners are outlined. Also, tables listing additional webservers and software tools for predicting RNA-protein interactions, as well as databases that contain valuable information about known RNA-protein complexes and recognition sites for RNA-binding proteins, are provided. Although only one of the tools described, lncPro, was designed expressly to identify proteins that bind long noncoding RNAs (including paRNAs), all three approaches can be applied to predict potential binding partners for both coding and noncoding RNAs (ncRNAs).
Collapse
|
11
|
A Plasmodium-like virulence effector of the soybean cyst nematode suppresses plant innate immunity. THE NEW PHYTOLOGIST 2016; 212:444-60. [PMID: 27265684 DOI: 10.1111/nph.14047] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 05/04/2016] [Indexed: 05/19/2023]
Abstract
Heterodera glycines, the soybean cyst nematode, delivers effector proteins into soybean roots to initiate and maintain an obligate parasitic relationship. HgGLAND18 encodes a candidate H. glycines effector and is expressed throughout the infection process. We used a combination of molecular, genetic, bioinformatic and phylogenetic analyses to determine the role of HgGLAND18 during H. glycines infection. HgGLAND18 is necessary for pathogenicity in compatible interactions with soybean. The encoded effector strongly suppresses both basal and hypersensitive cell death innate immune responses, and immunosuppression requires the presence and coordination between multiple protein domains. The N-terminal domain in HgGLAND18 contains unique sequence similarity to domains of an immunosuppressive effector of Plasmodium spp., the malaria parasites. The Plasmodium effector domains functionally complement the loss of the N-terminal domain from HgGLAND18. In-depth sequence searches and phylogenetic analyses demonstrate convergent evolution between effectors from divergent parasites of plants and animals as the cause of sequence and functional similarity.
Collapse
|
12
|
Computational prediction of protein interfaces: A review of data driven methods. FEBS Lett 2015; 589:3516-26. [PMID: 26460190 PMCID: PMC4655202 DOI: 10.1016/j.febslet.2015.10.003] [Citation(s) in RCA: 101] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2015] [Revised: 10/01/2015] [Accepted: 10/02/2015] [Indexed: 01/06/2023]
Abstract
Reliably pinpointing which specific amino acid residues form the interface(s) between a protein and its binding partner(s) is critical for understanding the structural and physicochemical determinants of protein recognition and binding affinity, and has wide applications in modeling and validating protein interactions predicted by high-throughput methods, in engineering proteins, and in prioritizing drug targets. Here, we review the basic concepts, principles and recent advances in computational approaches to the analysis and prediction of protein-protein interfaces. We point out caveats for objectively evaluating interface predictors, and discuss various applications of data-driven interface predictors for improving energy model-driven protein-protein docking. Finally, we stress the importance of exploiting binding partner information in reliably predicting interfaces and highlight recent advances in this emerging direction.
Collapse
|
13
|
Erratum. Assessing approaches and barriers to reduce antipsychotic drug use in Florida nursing homes. Aging Ment Health 2015; 19:i. [PMID: 25751410 DOI: 10.1080/13607863.2014.998484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
14
|
Computational modeling suggests dimerization of equine infectious anemia virus Rev is required for RNA binding. Retrovirology 2014; 11:115. [PMID: 25533001 PMCID: PMC4299382 DOI: 10.1186/s12977-014-0115-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 11/27/2014] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND The lentiviral Rev protein mediates nuclear export of intron-containing viral RNAs that encode structural proteins or serve as the viral genome. Following translation, HIV-1 Rev localizes to the nucleus and binds its cognate sequence, termed the Rev-responsive element (RRE), in incompletely spliced viral RNA. Rev subsequently multimerizes along the viral RNA and associates with the cellular Crm1 export machinery to translocate the RNA-protein complex to the cytoplasm. Equine infectious anemia virus (EIAV) Rev is functionally homologous to HIV-1 Rev, but shares very little sequence similarity and differs in domain organization. EIAV Rev also contains a bipartite RNA binding domain comprising two short arginine-rich motifs (designated ARM-1 and ARM-2) spaced 79 residues apart in the amino acid sequence. To gain insight into the topology of the bipartite RNA binding domain, a computational approach was used to model the tertiary structure of EIAV Rev. RESULTS The tertiary structure of EIAV Rev was modeled using several protein structure prediction and model quality assessment servers. Two types of structures were predicted: an elongated structure with an extended central alpha helix, and a globular structure with a central bundle of helices. Assessment of models on the basis of biophysical properties indicated they were of average quality. In almost all models, ARM-1 and ARM-2 were spatially separated by >15 Å, suggesting that they do not form a single RNA binding interface on the monomer. A highly conserved canonical coiled-coil motif was identified in the central region of EIAV Rev, suggesting that an RNA binding interface could be formed through dimerization of Rev and juxtaposition of ARM-1 and ARM-2. In support of this, purified Rev protein migrated as a dimer in Blue native gels, and mutation of a residue predicted to form a key coiled-coil contact disrupted dimerization and abrogated RNA binding. In contrast, mutation of residues outside the predicted coiled-coil interface had no effect on dimerization or RNA binding. CONCLUSIONS Our results suggest that EIAV Rev binding to the RRE requires dimerization via a coiled-coil motif to juxtapose two RNA binding motifs, ARM-1 and ARM-2.
Collapse
|
15
|
G-Quadruplex (G4) Motifs in the Maize (Zea mays L.) Genome Are Enriched at Specific Locations in Thousands of Genes Coupled to Energy Status, Hypoxia, Low Sugar, and Nutrient Deprivation. J Genet Genomics 2014; 41:627-47. [DOI: 10.1016/j.jgg.2014.10.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2014] [Revised: 10/16/2014] [Accepted: 10/24/2014] [Indexed: 02/07/2023]
|
16
|
RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins. PLoS One 2014; 9:e97725. [PMID: 24846307 PMCID: PMC4028231 DOI: 10.1371/journal.pone.0097725] [Citation(s) in RCA: 82] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Accepted: 04/08/2014] [Indexed: 01/18/2023] Open
Abstract
Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/.
Collapse
|
17
|
DockRank: ranking docked conformations using partner-specific sequence homology-based protein interface prediction. Proteins 2014; 82:250-67. [PMID: 23873600 PMCID: PMC4417613 DOI: 10.1002/prot.24370] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2012] [Revised: 06/27/2013] [Accepted: 07/09/2013] [Indexed: 12/11/2022]
Abstract
Selecting near-native conformations from the immense number of conformations generated by docking programs remains a major challenge in molecular docking. We introduce DockRank, a novel approach to scoring docked conformations based on the degree to which the interface residues of the docked conformation match a set of predicted interface residues. DockRank uses interface residues predicted by partner-specific sequence homology-based protein-protein interface predictor (PS-HomPPI), which predicts the interface residues of a query protein with a specific interaction partner. We compared the performance of DockRank with several state-of-the-art docking scoring functions using Success Rate (the percentage of cases that have at least one near-native conformation among the top m conformations) and Hit Rate (the percentage of near-native conformations that are included among the top m conformations). In cases where it is possible to obtain partner-specific (PS) interface predictions from PS-HomPPI, DockRank consistently outperforms both (i) ZRank and IRAD, two state-of-the-art energy-based scoring functions (improving Success Rate by up to 4-fold); and (ii) Variants of DockRank that use predicted interface residues obtained from several protein interface predictors that do not take into account the binding partner in making interface predictions (improving success rate by up to 39-fold). The latter result underscores the importance of using partner-specific interface residues in scoring docked conformations. We show that DockRank, when used to re-rank the conformations returned by ClusPro, improves upon the original ClusPro rankings in terms of both Success Rate and Hit Rate. DockRank is available as a server at http://einstein.cs.iastate.edu/DockRank/.
Collapse
|
18
|
Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics 2012; 13:89. [PMID: 22574904 PMCID: PMC3490755 DOI: 10.1186/1471-2105-13-89] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 05/10/2012] [Indexed: 11/15/2022] Open
Abstract
Background RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ‘code’ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. Results We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. Conclusions Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
Collapse
|
19
|
Predicting protein-protein interface residues using local surface structural similarity. BMC Bioinformatics 2012; 13:41. [PMID: 22424103 PMCID: PMC3386866 DOI: 10.1186/1471-2105-13-41] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2011] [Accepted: 03/18/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of the residues in protein-protein interaction sites has a significant impact in problems such as drug discovery. Motivated by the observation that the set of interface residues of a protein tend to be conserved even among remote structural homologs, we introduce PrISE, a family of local structural similarity-based computational methods for predicting protein-protein interface residues. RESULTS We present a novel representation of the surface residues of a protein in the form of structural elements. Each structural element consists of a central residue and its surface neighbors. The PrISE family of interface prediction methods uses a representation of structural elements that captures the atomic composition and accessible surface area of the residues that make up each structural element. Each of the members of the PrISE methods identifies for each structural element in the query protein, a collection of similar structural elements in its repository of structural elements and weights them according to their similarity with the structural element of the query protein. PrISEL relies on the similarity between structural elements (i.e. local structural similarity). PrISEG relies on the similarity between protein surfaces (i.e. general structural similarity). PrISEC, combines local structural similarity and general structural similarity to predict interface residues. These predictors label the central residue of a structural element in a query protein as an interface residue if a weighted majority of the structural elements that are similar to it are interface residues, and as a non-interface residue otherwise. The results of our experiments using three representative benchmark datasets show that the PrISEC outperforms PrISEL and PrISEG; and that PrISEC is highly competitive with state-of-the-art structure-based methods for predicting protein-protein interface residues. Our comparison of PrISEC with PredUs, a recently developed method for predicting interface residues of a query protein based on the known interface residues of its (global) structural homologs, shows that performance superior or comparable to that of PredUs can be obtained using only local surface structural similarity. PrISEC is available as a Web server at http://prise.cs.iastate.edu/ CONCLUSIONS Local surface structural similarity based methods offer a simple, efficient, and effective approach to predict protein-protein interface residues.
Collapse
|
20
|
Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489. [PMID: 22192482 PMCID: PMC3322362 DOI: 10.1186/1471-2105-12-489] [Citation(s) in RCA: 334] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2011] [Accepted: 12/22/2011] [Indexed: 11/22/2022] Open
Abstract
Background RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput experiments to identify RNA-protein interactions are beginning to provide valuable information about the complexity of RNA-protein interaction networks, but are expensive and time consuming. Hence, there is a need for reliable computational methods for predicting RNA-protein interactions. Results We propose RPISeq, a family of classifiers for predicting RNA-protein interactions using only sequence information. Given the sequences of an RNA and a protein as input, RPIseq predicts whether or not the RNA-protein pair interact. The RNA sequence is encoded as a normalized vector of its ribonucleotide 4-mer composition, and the protein sequence is encoded as a normalized vector of its 3-mer composition, based on a 7-letter reduced alphabet representation. Two variants of RPISeq are presented: RPISeq-SVM, which uses a Support Vector Machine (SVM) classifier and RPISeq-RF, which uses a Random Forest classifier. On two non-redundant benchmark datasets extracted from the Protein-RNA Interface Database (PRIDB), RPISeq achieved an AUC (Area Under the Receiver Operating Characteristic (ROC) curve) of 0.96 and 0.92. On a third dataset containing only mRNA-protein interactions, the performance of RPISeq was competitive with that of a published method that requires information regarding many different features (e.g., mRNA half-life, GO annotations) of the putative RNA and protein partners. In addition, RPISeq classifiers trained using the PRIDB data correctly predicted the majority (57-99%) of non-coding RNA-protein interactions in NPInter-derived networks from E. coli, S. cerevisiae, D. melanogaster, M. musculus, and H. sapiens. Conclusions Our experiments with RPISeq demonstrate that RNA-protein interactions can be reliably predicted using only sequence-derived information. RPISeq offers an inexpensive method for computational construction of RNA-protein interaction networks, and should provide useful insights into the function of non-coding RNAs. RPISeq is freely available as a web-based server at http://pridb.gdcb.iastate.edu/RPISeq/.
Collapse
|
21
|
Predicting MHC-II binding affinity using multiple instance regression. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1067-1079. [PMID: 20855923 PMCID: PMC3400677 DOI: 10.1109/tcbb.2010.94] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Reliably predicting the ability of antigen peptides to bind to major histocompatibility complex class II (MHC-II) molecules is an essential step in developing new vaccines. Uncovering the amino acid sequence correlates of the binding affinity of MHC-II binding peptides is important for understanding pathogenesis and immune response. The task of predicting MHC-II binding peptides is complicated by the significant variability in their length. Most existing computational methods for predicting MHC-II binding peptides focus on identifying a nine amino acids core region in each binding peptide. We formulate the problems of qualitatively and quantitatively predicting flexible length MHC-II peptides as multiple instance learning and multiple instance regression problems, respectively. Based on this formulation, we introduce MHCMIR, a novel method for predicting MHC-II binding affinity using multiple instance regression. We present results of experiments using several benchmark data sets that show that MHCMIR is competitive with the state-of-the-art methods for predicting MHC-II binding peptides. An online web server that implements the MHCMIR method for MHC-II binding affinity prediction is freely accessible at http://ailab.cs.iastate.edu/mhcmir.
Collapse
|
22
|
HomPPI: a class of sequence homology based protein-protein interface prediction methods. BMC Bioinformatics 2011; 12:244. [PMID: 21682895 PMCID: PMC3213298 DOI: 10.1186/1471-2105-12-244] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2010] [Accepted: 06/17/2011] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. RESULTS We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. CONCLUSIONS Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners.
Collapse
|
23
|
Human telomerase model shows the role of the TEN domain in advancing the double helix for the next polymerization step. Proc Natl Acad Sci U S A 2011; 108:9443-8. [PMID: 21606328 PMCID: PMC3111281 DOI: 10.1073/pnas.1015399108] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Telomerases constitute a group of specialized ribonucleoprotein enzymes that remediate chromosomal shrinkage resulting from the "end-replication" problem. Defects in telomere length regulation are associated with several diseases as well as with aging and cancer. Despite significant progress in understanding the roles of telomerase, the complete structure of the human telomerase enzyme bound to telomeric DNA remains elusive, with the detailed molecular mechanism of telomere elongation still unknown. By application of computational methods for distant homology detection, comparative modeling, and molecular docking, guided by available experimental data, we have generated a three-dimensional structural model of a partial telomerase elongation complex composed of three essential protein domains bound to a single-stranded telomeric DNA sequence in the form of a heteroduplex with the template region of the human RNA subunit, TER. This model provides a structural mechanism for the processivity of telomerase and offers new insights into elongation. We conclude that the RNADNA heteroduplex is constrained by the telomerase TEN domain through repeated extension cycles and that the TEN domain controls the process by moving the template ahead one base at a time by translation and rotation of the double helix. The RNA region directly following the template can bind complementarily to the newly synthesized telomeric DNA, while the template itself is reused in the telomerase active site during the next reaction cycle. This first structural model of the human telomerase enzyme provides many details of the molecular mechanism of telomerase and immediately provides an important target for rational drug design.
Collapse
|
24
|
Targeted mutagenesis of duplicated genes in soybean with zinc-finger nucleases. PLANT PHYSIOLOGY 2011; 156:466-73. [PMID: 21464476 PMCID: PMC3177250 DOI: 10.1104/pp.111.172981] [Citation(s) in RCA: 151] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Accepted: 04/03/2011] [Indexed: 05/18/2023]
Abstract
We performed targeted mutagenesis of a transgene and nine endogenous soybean (Glycine max) genes using zinc-finger nucleases (ZFNs). A suite of ZFNs were engineered by the recently described context-dependent assembly platform--a rapid, open-source method for generating zinc-finger arrays. Specific ZFNs targeting dicer-like (DCL) genes and other genes involved in RNA silencing were cloned into a vector under an estrogen-inducible promoter. A hairy-root transformation system was employed to investigate the efficiency of ZFN mutagenesis at each target locus. Transgenic roots exhibited somatic mutations localized at the ZFN target sites for seven out of nine targeted genes. We next introduced a ZFN into soybean via whole-plant transformation and generated independent mutations in the paralogous genes DCL4a and DCL4b. The dcl4b mutation showed efficient heritable transmission of the ZFN-induced mutation in the subsequent generation. These findings indicate that ZFN-based mutagenesis provides an efficient method for making mutations in duplicate genes that are otherwise difficult to study due to redundancy. We also developed a publicly accessible Web-based tool to identify sites suitable for engineering context-dependent assembly ZFNs in the soybean genome.
Collapse
|
25
|
ZFNGenome: a comprehensive resource for locating zinc finger nuclease target sites in model organisms. BMC Genomics 2011; 12:83. [PMID: 21276248 PMCID: PMC3042413 DOI: 10.1186/1471-2164-12-83] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 01/28/2011] [Indexed: 02/04/2023] Open
Abstract
Background Zinc Finger Nucleases (ZFNs) have tremendous potential as tools to facilitate genomic modifications, such as precise gene knockouts or gene replacements by homologous recombination. ZFNs can be used to advance both basic research and clinical applications, including gene therapy. Recently, the ability to engineer ZFNs that target any desired genomic DNA sequence with high fidelity has improved significantly with the introduction of rapid, robust, and publicly available techniques for ZFN design such as the Oligomerized Pool ENgineering (OPEN) method. The motivation for this study is to make resources for genome modifications using OPEN-generated ZFNs more accessible to researchers by creating a user-friendly interface that identifies and provides quality scores for all potential ZFN target sites in the complete genomes of several model organisms. Description ZFNGenome is a GBrowse-based tool for identifying and visualizing potential target sites for OPEN-generated ZFNs. ZFNGenome currently includes a total of more than 11.6 million potential ZFN target sites, mapped within the fully sequenced genomes of seven model organisms; S. cerevisiae, C. reinhardtii, A. thaliana, D. melanogaster, D. rerio, C. elegans, and H. sapiens and can be visualized within the flexible GBrowse environment. Additional model organisms will be included in future updates. ZFNGenome provides information about each potential ZFN target site, including its chromosomal location and position relative to transcription initiation site(s). Users can query ZFNGenome using several different criteria (e.g., gene ID, transcript ID, target site sequence). Tracks in ZFNGenome also provide "uniqueness" and ZiFOpT (Zinc Finger OPEN Targeter) "confidence" scores that estimate the likelihood that a chosen ZFN target site will function in vivo. ZFNGenome is dynamically linked to ZiFDB, allowing users access to all available information about zinc finger reagents, such as the effectiveness of a given ZFN in creating double-stranded breaks. Conclusions ZFNGenome provides a user-friendly interface that allows researchers to access resources and information regarding genomic target sites for engineered ZFNs in seven model organisms. This genome-wide database of potential ZFN target sites should greatly facilitate the utilization of ZFNs in both basic and clinical research. ZFNGenome is freely available at: http://bindr.gdcb.iastate.edu/ZFNGenome or at the Zinc Finger Consortium website: http://www.zincfingers.org/.
Collapse
|
26
|
Selection-free zinc-finger-nuclease engineering by context-dependent assembly (CoDA). Nat Methods 2011; 8:67-9. [PMID: 21151135 PMCID: PMC3018472 DOI: 10.1038/nmeth.1542] [Citation(s) in RCA: 359] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2010] [Accepted: 11/16/2010] [Indexed: 11/30/2022]
Abstract
Engineered zinc-finger nucleases (ZFNs) enable targeted genome modification. Here we describe context-dependent assembly (CoDA), a platform for engineering ZFNs using only standard cloning techniques or custom DNA synthesis. Using CoDA-generated ZFNs, we rapidly altered 20 genes in Danio rerio, Arabidopsis thaliana and Glycine max. The simplicity and efficacy of CoDA will enable broad adoption of ZFN technology and make possible large-scale projects focused on multigene pathways or genome-wide alterations.
Collapse
|
27
|
Ranking Docked Models of Protein-Protein Complexes Using Predicted Partner-Specific Protein-Protein Interfaces: A Preliminary Study. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2011; 2011:441-445. [PMID: 25905110 DOI: 10.1145/2147805.2147866] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Computational protein-protein docking is a valuable tool for determining the conformation of complexes formed by interacting proteins. Selecting near-native conformations from the large number of possible models generated by docking software presents a significant challenge in practice. We introduce a novel method for ranking docked conformations based on the degree of overlap between the interface residues of a docked conformation formed by a pair of proteins with the set of predicted interface residues between them. Our approach relies on a method, called PS-HomPPI, for reliably predicting protein-protein interface residues by taking into account information derived from both interacting proteins. PS-HomPPI infers the residues of a query protein that are likely to interact with a partner protein based on known interface residues of the homo-interologs of the query-partner protein pair, i.e., pairs of interacting proteins that are homologous to the query protein and partner protein. Our results on Docking Benchmark 3.0 show that the quality of the ranking of docked conformations using our method is consistently superior to that produced using ClusPro cluster-size-based and energy-based criteria for 61 out of the 64 docking complexes for which PS-HomPPI produces interface predictions. An implementation of our method for ranking docked models is freely available at: http://einstein.cs.iastate.edu/DockRank/.
Collapse
|
28
|
Abstract
The Protein–RNA Interface Database (PRIDB) is a comprehensive database of protein–RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein–RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein–RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein–RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein–RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB.
Collapse
|
29
|
Predicting success of oligomerized pool engineering (OPEN) for zinc finger target site sequences. BMC Bioinformatics 2010; 11:543. [PMID: 21044337 PMCID: PMC3098093 DOI: 10.1186/1471-2105-11-543] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2010] [Accepted: 11/02/2010] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Precise and efficient methods for gene targeting are critical for detailed functional analysis of genomes and regulatory networks and for potentially improving the efficacy and safety of gene therapies. Oligomerized Pool ENgineering (OPEN) is a recently developed method for engineering C2H2 zinc finger proteins (ZFPs) designed to bind specific DNA sequences with high affinity and specificity in vivo. Because generation of ZFPs using OPEN requires considerable effort, a computational method for identifying the sites in any given gene that are most likely to be successfully targeted by this method is desirable. RESULTS Analysis of the base composition of experimentally validated ZFP target sites identified important constraints on the DNA sequence space that can be effectively targeted using OPEN. Using alternate encodings to represent ZFP target sites, we implemented Naïve Bayes and Support Vector Machine classifiers capable of distinguishing "active" targets, i.e., ZFP binding sites that can be targeted with a high rate of success, from those that are "inactive" or poor targets for ZFPs generated using current OPEN technologies. When evaluated using leave-one-out cross-validation on a dataset of 135 experimentally validated ZFP target sites, the best Naïve Bayes classifier, designated ZiFOpT, achieved overall accuracy of 87% and specificity+ of 90%, with an ROC AUC of 0.89. When challenged with a completely independent test set of 140 newly validated ZFP target sites, ZiFOpT performance was comparable in terms of overall accuracy (88%) and specificity+ (92%), but with reduced ROC AUC (0.77). Users can rank potentially active ZFP target sites using a confidence score derived from the posterior probability returned by ZiFOpT. CONCLUSION ZiFOpT, a machine learning classifier trained to identify DNA sequences amenable for targeting by OPEN-generated zinc finger arrays, can guide users to target sites that are most likely to function successfully in vivo, substantially reducing the experimental effort required. ZiFOpT is freely available and incorporated in the Zinc Finger Targeter web server (http://bindr.gdcb.iastate.edu/ZiFiT).
Collapse
|
30
|
Abstract
ZiFiT (Zinc Finger Targeter) is a simple and intuitive web-based tool that provides an interface to identify potential binding sites for engineered zinc finger proteins (ZFPs) in user-supplied DNA sequences. In this updated version, ZiFiT identifies potential sites for ZFPs made by both the modular assembly and OPEN engineering methods. In addition, ZiFiT now integrates additional tools and resources including scoring schemes for modular assembly, an interface with the Zinc Finger Database (ZiFDB) of engineered ZFPs, and direct querying of NCBI BLAST servers for identifying potential off-target sites within a host genome. Taken together, these features facilitate design of ZFPs using reagents made available to the academic research community by the Zinc Finger Consortium. ZiFiT is freely available on the web without registration at http://bindr.gdcb.iastate.edu/ZiFiT/.
Collapse
|
31
|
|
32
|
Molecular and Biological Characterization of Equine Infectious Anemia Virus Rev. Curr HIV Res 2010; 8:87-93. [DOI: 10.2174/157016210790416424] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2009] [Accepted: 11/02/2009] [Indexed: 11/22/2022]
|
33
|
Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling. BMC Bioinformatics 2009; 10 Suppl 4:S4. [PMID: 19426452 PMCID: PMC2681071 DOI: 10.1186/1471-2105-10-s4-s4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of functionally important sites in biomolecular sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental determination of such sites lags far behind the number of known biomolecular sequences. Hence, there is a need to develop reliable computational methods for identifying functionally important sites from biomolecular sequences. RESULTS We present a mixture of experts approach to biomolecular sequence labeling that takes into account the global similarity between biomolecular sequences. Our approach combines unsupervised and supervised learning techniques. Given a set of sequences and a similarity measure defined on pairs of sequences, we learn a mixture of experts model by using spectral clustering to learn the hierarchical structure of the model and by using bayesian techniques to combine the predictions of the experts. We evaluate our approach on two biomolecular sequence labeling problems: RNA-protein and DNA-protein interface prediction problems. The results of our experiments show that global sequence similarity can be exploited to improve the performance of classifiers trained to label biomolecular sequence data. CONCLUSION The mixture of experts model helps improve the performance of machine learning methods for identifying functionally important sites in biomolecular sequences.
Collapse
|
34
|
An affinity-based scoring scheme for predicting DNA-binding activities of modularly assembled zinc-finger proteins. Nucleic Acids Res 2009; 37:506-15. [PMID: 19056825 PMCID: PMC2632909 DOI: 10.1093/nar/gkn962] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2008] [Revised: 11/10/2008] [Accepted: 11/12/2008] [Indexed: 11/12/2022] Open
Abstract
Zinc-finger proteins (ZFPs) have long been recognized for their potential to manipulate genetic information because they can be engineered to bind novel DNA targets. Individual zinc-finger domains (ZFDs) bind specific DNA triplet sequences; their apparent modularity has led some groups to propose methods that allow virtually any desired DNA motif to be targeted in vitro. In practice, however, ZFPs engineered using this 'modular assembly' approach do not always function well in vivo. Here we report a modular assembly scoring strategy that both identifies combinations of modules least likely to function efficiently in vivo and provides accurate estimates of their relative binding affinities in vitro. Predicted binding affinities for 53 'three-finger' ZFPs, computed based on energy contributions of the constituent modules, were highly correlated (r = 0.80) with activity levels measured in bacterial two-hybrid assays. Moreover, K(d) values for seven modularly assembled ZFPs and their intended targets, measured using fluorescence anisotropy, were also highly correlated with predictions (r = 0.91). We propose that success rates for ZFP modular assembly can be significantly improved by exploiting the score-based strategy described here.
Collapse
|
35
|
Structural model of the Rev regulatory protein from equine infectious anemia virus. PLoS One 2009; 4:e4178. [PMID: 19137065 PMCID: PMC2613556 DOI: 10.1371/journal.pone.0004178] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2008] [Accepted: 12/03/2008] [Indexed: 11/23/2022] Open
Abstract
Rev is an essential regulatory protein in the equine infectious anemia virus (EIAV) and other lentiviruses, including HIV-1. It binds incompletely spliced viral mRNAs and shuttles them from the nucleus to the cytoplasm, a critical prerequisite for the production of viral structural proteins and genomic RNA. Despite its important role in production of infectious virus, the development of antiviral therapies directed against Rev has been hampered by the lack of an experimentally-determined structure of the full length protein. We have used a combined computational and biochemical approach to generate and evaluate a structural model of the Rev protein. The modeled EIAV Rev (ERev) structure includes a total of 6 helices, four of which form an anti-parallel four-helix bundle. The first helix contains the leucine-rich nuclear export signal (NES). An arginine-rich RNA binding motif, RRDRW, is located in a solvent-exposed loop region. An ERLE motif required for Rev activity is predicted to be buried in the core of modeled structure where it plays an essential role in stabilization of the Rev fold. This structural model is supported by existing genetic and functional data as well as by targeted mutagenesis of residues predicted to be essential for overall structural integrity. Our predicted structure should increase understanding of structure-function relationships in Rev and may provide a basis for the design of new therapies for lentiviral diseases.
Collapse
|
36
|
Abstract
Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods.
Collapse
|
37
|
Zinc Finger Database (ZiFDB): a repository for information on C2H2 zinc fingers and engineered zinc-finger arrays. Nucleic Acids Res 2008; 37:D279-83. [PMID: 18812396 PMCID: PMC2686427 DOI: 10.1093/nar/gkn606] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Zinc fingers are the most abundant DNA-binding motifs encoded by eukaryotic genomes and one of the best understood DNA-recognition domains. Each zinc finger typically binds a 3-nt target sequence, and it is possible to engineer zinc-finger arrays (ZFAs) that recognize extended DNA sequences by linking together individual zinc fingers. Engineered zinc-finger proteins have proven to be valuable tools for gene regulation and genome modification because they target specific sites in a genome. Here we describe ZiFDB (Zinc Finger Database; http://bindr.gdcb.iastate.edu/ZiFDB), a web-accessible resource that compiles information on individual zinc fingers and engineered ZFAs. To enhance its utility, ZiFDB is linked to the output from ZiFiT—a software package that assists biologists in finding sites within target genes for engineering zinc-finger proteins. For many molecular biologists, ZiFDB will be particularly valuable for determining if a given ZFA (or portion thereof) has previously been constructed and whether or not it has the requisite DNA-binding activity for their experiments. ZiFDB will also be a valuable resource for those scientists interested in better understanding how zinc-finger proteins recognize target DNA.
Collapse
|
38
|
Abstract
The identification and characterization of B‐cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting linear B‐cell epitopes are highly desirable. We evaluated Support Vector Machine (SVM) classifiers trained utilizing five different kernel methods using fivefold cross‐validation on a homology‐reduced data set of 701 linear B‐cell epitopes, extracted from Bcipep database, and 701 non‐epitopes, randomly extracted from SwissProt sequences. Based on the results of our computational experiments, we propose BCPred, a novel method for predicting linear B‐cell epitopes using the subsequence kernel. We show that the predictive performance of BCPred (AUC = 0.758) outperforms 11 SVM‐based classifiers developed and evaluated in our experiments as well as our implementation of AAP (AUC = 0.7), a recently proposed method for predicting linear B‐cell epitopes using amino acid pair antigenicity. Furthermore, we compared BCPred with AAP and ABCPred, a method that uses recurrent neural networks, using two data sets of unique B‐cell epitopes that had been previously used to evaluate ABCPred. Analysis of the data sets used and the results of this comparison show that conclusions about the relative performance of different B‐cell epitope prediction methods drawn on the basis of experiments using data sets of unique B‐cell epitopes are likely to yield overly optimistic estimates of performance of evaluated methods. This argues for the use of carefully homology‐reduced data sets in comparing B‐cell epitope prediction methods to avoid misleading conclusions about how different methods compare to each other. Our homology‐reduced data set and implementations of BCPred as well as the APP method are publicly available through our web‐based server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/. Copyright © 2008 John Wiley & Sons, Ltd.
Collapse
|
39
|
Rapid "open-source" engineering of customized zinc-finger nucleases for highly efficient gene modification. Mol Cell 2008; 31:294-301. [PMID: 18657511 PMCID: PMC2535758 DOI: 10.1016/j.molcel.2008.06.016] [Citation(s) in RCA: 507] [Impact Index Per Article: 31.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2008] [Revised: 05/16/2008] [Accepted: 06/04/2008] [Indexed: 11/17/2022]
Abstract
Custom-made zinc-finger nucleases (ZFNs) can induce targeted genome modifications with high efficiency in cell types including Drosophila, C. elegans, plants, and humans. A bottleneck in the application of ZFN technology has been the generation of highly specific engineered zinc-finger arrays. Here we describe OPEN (Oligomerized Pool ENgineering), a rapid, publicly available strategy for constructing multifinger arrays, which we show is more effective than the previously published modular assembly method. We used OPEN to construct 37 highly active ZFN pairs which induced targeted alterations with high efficiencies (1%-50%) at 11 different target sites located within three endogenous human genes (VEGF-A, HoxB13, and CFTR), an endogenous plant gene (tobacco SuRA), and a chromosomally integrated EGFP reporter gene. In summary, OPEN provides an "open-source" method for rapidly engineering highly active zinc-finger arrays, thereby enabling broader practice, development, and application of ZFN technology for biological research and gene therapy.
Collapse
|
40
|
Analysis of the EIAV Rev-responsive element (RRE) reveals a conserved RNA motif required for high affinity Rev binding in both HIV-1 and EIAV. PLoS One 2008; 3:e2272. [PMID: 18523581 PMCID: PMC2386976 DOI: 10.1371/journal.pone.0002272] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2008] [Accepted: 04/15/2008] [Indexed: 11/29/2022] Open
Abstract
A cis-acting RNA regulatory element, the Rev-responsive element (RRE), has essential roles in replication of lentiviruses, including human immunodeficiency virus (HIV-1) and equine infection anemia virus (EIAV). The RRE binds the viral trans-acting regulatory protein, Rev, to mediate nucleocytoplasmic transport of incompletely spliced mRNAs encoding viral structural genes and genomic RNA. Because of its potential as a clinical target, RRE-Rev interactions have been well studied in HIV-1; however, detailed molecular structures of Rev-RRE complexes in other lentiviruses are still lacking. In this study, we investigate the secondary structure of the EIAV RRE and interrogate regulatory protein-RNA interactions in EIAV Rev-RRE complexes. Computational prediction and detailed chemical probing and footprinting experiments were used to determine the RNA secondary structure of EIAV RRE-1, a 555 nt region that provides RRE function in vivo. Chemical probing experiments confirmed the presence of several predicted loop and stem-loop structures, which are conserved among 140 EIAV sequence variants. Footprinting experiments revealed that Rev binding induces significant structural rearrangement in two conserved domains characterized by stable stem-loop structures. Rev binding region-1 (RBR-1) corresponds to a genetically-defined Rev binding region that overlaps exon 1 of the EIAV rev gene and contains an exonic splicing enhancer (ESE). RBR-2, characterized for the first time in this study, is required for high affinity binding of EIAV Rev to the RRE. RBR-2 contains an RNA structural motif that is also found within the high affinity Rev binding site in HIV-1 (stem-loop IIB), and within or near mapped RRE regions of four additional lentiviruses. The powerful integration of computational and experimental approaches in this study has generated a validated RNA secondary structure for the EIAV RRE and provided provocative evidence that high affinity Rev binding sites of HIV-1 and EIAV share a conserved RNA structural motif. The presence of this motif in phylogenetically divergent lentiviruses suggests that it may play a role in highly conserved interactions that could be targeted in novel anti-lentiviral therapies.
Collapse
|
41
|
Predicting flexible length linear B-cell epitopes. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2008; 7:121-132. [PMID: 19642274 PMCID: PMC3400678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.
Collapse
|
42
|
Abstract
We analyze the characteristics of protein-protein interfaces using the largest datasets available from the Protein Data Bank (PDB). We start with a comparison of interfaces with protein cores and non-interface surfaces. The results show that interfaces differ from protein cores and non-interface surfaces in residue composition, sequence entropy, and secondary structure. Since interfaces, protein cores, and non-interface surfaces have different solvent accessibilities, it is important to investigate whether the observed differences are due to the differences in solvent accessibility or differences in functionality. We separate out the effect of solvent accessibility by comparing interfaces with a set of residues having the same solvent accessibility as the interfaces. This strategy reveals residue distribution propensities that are not observable by comparing interfaces with protein cores and non-interface surfaces. Our conclusions are that there are larger numbers of hydrophobic residues, particularly aromatic residues, in interfaces, and the interactions apparently favored in interfaces include the opposite charge pairs and hydrophobic pairs. Surprisingly, Pro-Trp pairs are over represented in interfaces, presumably because of favorable geometries. The analysis is repeated using three datasets having different constraints on sequence similarity and structure quality. Consistent results are obtained across these datasets. We have also investigated separately the characteristics of heteromeric interfaces and homomeric interfaces.
Collapse
|
43
|
Striking similarities in diverse telomerase proteins revealed by combining structure prediction and machine learning approaches. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2008:501-512. [PMID: 18229711 PMCID: PMC2569851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Telomerase is a ribonucleoprotein enzyme that adds telomeric DNA repeat sequences to the ends of linear chromosomes. The enzyme plays pivotal roles in cellular senescence and aging, and because it provides a telomere maintenance mechanism for approximately 90% of human cancers, it is a promising target for cancer therapy. Despite its importance, a high-resolution structure of the telomerase enzyme has been elusive, although a crystal structure of an N-terminal domain (TEN) of the telomerase reverse transcriptase subunit (TERT) from Tetrahymena has been reported. In this study, we used a comparative strategy, in which sequence-based machine learning approaches were integrated with computational structural modeling, to explore the potential conservation of structural and functional features of TERT in phylogenetically diverse species. We generated structural models of the N-terminal domains from human and yeast TERT using a combination of threading and homology modeling with the Tetrahymena TEN structure as a template. Comparative analysis of predicted and experimentally verified DNA and RNA binding residues, in the context of these structures, revealed significant similarities in nucleic acid binding surfaces of Tetrahymena and human TEN domains. In addition, the combined evidence from machine learning and structural modeling identified several specific amino acids that are likely to play a role in binding DNA or RNA, but for which no experimental evidence is currently available.
Collapse
|
44
|
Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC Bioinformatics 2007; 8:438. [PMID: 17996106 PMCID: PMC2220009 DOI: 10.1186/1471-2105-8-438] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2007] [Accepted: 11/09/2007] [Indexed: 11/10/2022] Open
Abstract
Background Glycosylation is one of the most complex post-translational modifications (PTMs) of proteins in eukaryotic cells. Glycosylation plays an important role in biological processes ranging from protein folding and subcellular localization, to ligand recognition and cell-cell interactions. Experimental identification of glycosylation sites is expensive and laborious. Hence, there is significant interest in the development of computational methods for reliable prediction of glycosylation sites from amino acid sequences. Results We explore machine learning methods for training classifiers to predict the amino acid residues that are likely to be glycosylated using information derived from the target amino acid residue and its sequence neighbors. We compare the performance of Support Vector Machine classifiers and ensembles of Support Vector Machine classifiers trained on a dataset of experimentally determined N-linked, O-linked, and C-linked glycosylation sites extracted from O-GlycBase version 6.00, a database of 242 proteins from several different species. The results of our experiments show that the ensembles of Support Vector Machine classifiers outperform single Support Vector Machine classifiers on the problem of predicting glycosylation sites in terms of a range of standard measures for comparing the performance of classifiers. The resulting methods have been implemented in EnsembleGly, a web server for glycosylation site prediction. Conclusion Ensembles of Support Vector Machine classifiers offer an accurate and reliable approach to automated identification of putative glycosylation sites in glycoprotein sequences.
Collapse
|
45
|
Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach. BMC Bioinformatics 2007; 8:284. [PMID: 17683567 PMCID: PMC1994202 DOI: 10.1186/1471-2105-8-284] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2006] [Accepted: 08/03/2007] [Indexed: 11/29/2022] Open
Abstract
Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence.
Collapse
|
46
|
Standardized reagents and protocols for engineering zinc finger nucleases by modular assembly. Nat Protoc 2007; 1:1637-52. [PMID: 17406455 DOI: 10.1038/nprot.2006.259] [Citation(s) in RCA: 158] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Engineered zinc finger nucleases can stimulate gene targeting at specific genomic loci in insect, plant and human cells. Although several platforms for constructing artificial zinc finger arrays using "modular assembly" have been described, standardized reagents and protocols that permit rapid, cross-platform "mixing-and-matching" of the various zinc finger modules are not available. Here we describe a comprehensive, publicly available archive of plasmids encoding more than 140 well-characterized zinc finger modules together with complementary web-based software (termed ZiFiT) for identifying potential zinc finger target sites in a gene of interest. Our reagents have been standardized on a single platform, enabling facile mixing-and-matching of modules and transfer of assembled arrays to expression vectors without the need for specialized knowledge of zinc finger sequences or complicated oligonucleotide design. We also describe a bacterial cell-based reporter assay for rapidly screening the DNA-binding activities of assembled multi-finger arrays. This protocol can be completed in approximately 24-26 d.
Collapse
|
47
|
Abstract
Zinc Finger Targeter (ZiFiT) is a simple and intuitive web-based tool that facilitates the design of zinc finger proteins (ZFPs) that can bind to specific DNA sequences. The current version of ZiFiT is based on a widely employed method of ZFP design, the ‘modular assembly’ approach, in which pre-existing individual zinc fingers are linked together to recognize desired target DNA sequences. Several research groups have described experimentally characterized zinc finger modules that bind many of the 64 possible DNA triplets. ZiFiT leverages the combined capabilities of three of the largest and best characterized module archives by enabling users to select fingers from any of these sets. ZiFiT searches a query DNA sequence for target sites for which a ZFP can be designed using modules available in one or more of the three archives. In addition, ZiFiT output facilitates identification of specific zinc finger modules that are publicly available from the Zinc Finger Consortium. ZiFiT is freely available at http://bindr.gdcb.iastate.edu/ZiFiT/.
Collapse
|
48
|
Abstract
Understanding interactions between proteins and RNA is key to deciphering the mechanisms of many important biological processes. Here we describe RNABindR, a web-based server that identifies and displays RNA-binding residues in known protein–RNA complexes and predicts RNA-binding residues in proteins of unknown structure. RNABindR uses a distance cutoff to identify which amino acids contact RNA in solved complex structures (from the Protein Data Bank) and provides a labeled amino acid sequence and a Jmol graphical viewer in which RNA-binding residues are displayed in the context of the three-dimensional structure. Alternatively, RNABindR can use a Naive Bayes classifier trained on a non-redundant set of protein–RNA complexes from the PDB to predict which amino acids in a protein sequence of unknown structure are most likely to bind RNA. RNABindR automatically displays ‘high specificity’ and ‘high sensitivity’ predictions of RNA-binding residues. RNABindR is freely available at http://bindr.gdcb.iastate.edu/RNABindR.
Collapse
|
49
|
Codability criterion for picking proteinlike structures from random three-dimensional configurations. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:031921. [PMID: 17025681 DOI: 10.1103/physreve.74.031921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2004] [Revised: 07/24/2006] [Indexed: 05/12/2023]
Abstract
We show that the dominant eigenvectors of real protein structural contact matrices are highly correlated with their amino acid sequences. These results suggests that an ab initio sequence-independent profile exists for every protein structure and that this profile is highly effective in differentiating the ordering of amino acids in natural protein sequences from random sequences. This profile provides a structural code and is a key for understanding the unique behavior of protein structures. Using a lattice model, we show that there are special codable structures highly separated from random structures in the dominant eigenvector space of their structural contact matrices. As an example, we show our results provide a good explanation to the "designable principle" of protein structures.
Collapse
|
50
|
Prediction of RNA binding sites in proteins from amino acid sequence. RNA (NEW YORK, N.Y.) 2006; 12:1450-62. [PMID: 16790841 PMCID: PMC1524891 DOI: 10.1261/rna.2197306] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2005] [Accepted: 05/13/2006] [Indexed: 05/10/2023]
Abstract
RNA-protein interactions are vitally important in a wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses. We have developed a computational tool for predicting which amino acids of an RNA binding protein participate in RNA-protein interactions, using only the protein sequence as input. RNABindR was developed using machine learning on a validated nonredundant data set of interfaces from known RNA-protein complexes in the Protein Data Bank. It generates a classifier that captures primary sequence signals sufficient for predicting which amino acids in a given protein are located in the RNA-protein interface. In leave-one-out cross-validation experiments, RNABindR identifies interface residues with >85% overall accuracy. It can be calibrated by the user to obtain either high specificity or high sensitivity for interface residues. RNABindR, implementing a Naive Bayes classifier, performs as well as a more complex neural network classifier (to our knowledge, the only previously published sequence-based method for RNA binding site prediction) and offers the advantages of speed, simplicity and interpretability of results. RNABindR predictions on the human telomerase protein hTERT are in good agreement with experimental data. The availability of computational tools for predicting which residues in an RNA binding protein are likely to contact RNA should facilitate design of experiments to directly test RNA binding function and contribute to our understanding of the diversity, mechanisms, and regulation of RNA-protein complexes in biological systems. (RNABindR is available as a Web tool from http://bindr.gdcb.iastate.edu.).
Collapse
|