1
|
Discovery of deaminase functions by structure-based protein clustering. Cell 2023:S0092-8674(23)00593-7. [PMID: 37379837 DOI: 10.1016/j.cell.2023.05.041] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/24/2023] [Accepted: 05/26/2023] [Indexed: 06/30/2023]
Abstract
The elucidation of protein function and its exploitation in bioengineering have greatly advanced the life sciences. Protein mining efforts generally rely on amino acid sequences rather than protein structures. We describe here the use of AlphaFold2 to predict and subsequently cluster an entire protein family based on predicted structure similarities. We selected deaminase proteins to analyze and identified many previously unknown properties. We were surprised to find that most proteins in the DddA-like clade were not double-stranded DNA deaminases. We engineered the smallest single-strand-specific cytidine deaminase, enabling efficient cytosine base editor (CBE) to be packaged into a single adeno-associated virus (AAV). Importantly, we profiled a deaminase from this clade that edits robustly in soybean plants, which previously was inaccessible to CBEs. These discovered deaminases, based on AI-assisted structural predictions, greatly expand the utility of base editors for therapeutic and agricultural applications.
Collapse
|
2
|
PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features. Front Genet 2022; 13:875112. [PMID: 35547252 PMCID: PMC9081368 DOI: 10.3389/fgene.2022.875112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2022] [Accepted: 03/07/2022] [Indexed: 12/03/2022] Open
Abstract
The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC.
Collapse
|
3
|
Multiple profile models extract features from protein sequence data and resolve functional diversity of very different protein families. Mol Biol Evol 2022; 39:6556147. [PMID: 35353898 PMCID: PMC9016551 DOI: 10.1093/molbev/msac070] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyse sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. Profile-View agrees with the large set of functional data collected for these proteins from the literature regarding the organisation into functional subgroups and residues that characterise the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.
Collapse
|
4
|
Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes. Front Genet 2021; 12:797641. [PMID: 34887905 PMCID: PMC8650314 DOI: 10.3389/fgene.2021.797641] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/05/2021] [Indexed: 11/29/2022] Open
Abstract
Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body's life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.
Collapse
|
5
|
Abstract
The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer-grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre-computed domain annotations for a selected subset of sequences tracked by the NCBI's Entrez protein database. These can be retrieved or computed for a single sequence using CD-Search or in bulk using Batch CD-Search, or computed via standalone RPS-BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD-Search (Basic Protocol 1), a Batch CD-Search (Basic Protocol 2), and a Standalone RPS-BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors. Basic Protocol 1: CD-search Basic Protocol 2: Batch CD-search Basic Protocol 3: Standalone RPS-BLAST and rpsbproc.
Collapse
|
6
|
Development of a TSR-Based Method for Protein 3-D Structural Comparison With Its Applications to Protein Classification and Motif Discovery. Front Chem 2021; 8:602291. [PMID: 33520934 PMCID: PMC7838567 DOI: 10.3389/fchem.2020.602291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 12/14/2020] [Indexed: 11/24/2022] Open
Abstract
Development of protein 3-D structural comparison methods is important in understanding protein functions. At the same time, developing such a method is very challenging. In the last 40 years, ever since the development of the first automated structural method, ~200 papers were published using different representations of structures. The existing methods can be divided into five categories: sequence-, distance-, secondary structure-, geometry-based, and network-based structural comparisons. Each has its uniqueness, but also limitations. We have developed a novel method where the 3-D structure of a protein is modeled using the concept of Triangular Spatial Relationship (TSR), where triangles are constructed with the Cα atoms of a protein as vertices. Every triangle is represented using an integer, which we denote as “key,” A key is computed using the length, angle, and vertex labels based on a rule-based formula, which ensures assignment of the same key to identical TSRs across proteins. A structure is thereby represented by a vector of integers. Our method is able to accurately quantify similarity of structure or substructure by matching numbers of identical keys between two proteins. The uniqueness of our method includes: (i) a unique way to represent structures to avoid performing structural superimposition; (ii) use of triangles to represent substructures as it is the simplest primitive to capture shape; (iii) complex structure comparison is achieved by matching integers corresponding to multiple TSRs. Every substructure of one protein is compared to every other substructure in a different protein. The method is used in the studies of proteases and kinases because they play essential roles in cell signaling, and a majority of these constitute drug targets. The new motifs or substructures we identified specifically for proteases and kinases provide a deeper insight into their structural relations. Furthermore, the method provides a unique way to study protein conformational changes. In addition, the results from CATH and SCOP data sets clearly demonstrate that our method can distinguish alpha helices from beta pleated sheets and vice versa. Our method has the potential to be developed into a powerful tool for efficient structure-BLAST search and comparison, just as BLAST is for sequence search and alignment.
Collapse
|
7
|
Is There Scope for a Novel Mycelium Category of Proteins alongside Animals and Plants? Foods 2020; 9:E1151. [PMID: 32825591 PMCID: PMC7555420 DOI: 10.3390/foods9091151] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 08/13/2020] [Accepted: 08/17/2020] [Indexed: 12/20/2022] Open
Abstract
In the 21st century, we face a troubling trilemma of expanding populations, planetary and public wellbeing. Given this, shifts from animal to plant food protein are gaining momentum and are an important part of reducing carbon emissions and consumptive water use. However, as this fast-pace of change sets in and begins to firmly embed itself within food-based dietary guidelines (FBDG) and food policies we must raise an important question-is now an opportunistic time to include other novel, nutritious and sustainable proteins within FBGD? The current paper describes how food proteins are typically categorised within FBDG and discusses how these could further evolve. Presently, food proteins tend to fall under the umbrella of being 'animal-derived' or 'plant-based' whilst other valuable proteins i.e., fungal-derived appear to be comparatively overlooked. A PubMed search of systematic reviews and meta-analytical studies published over the last 5 years shows an established body of evidence for animal-derived proteins (although some findings were less favourable), plant-based proteins and an expanding body of science for mycelium/fungal-derived proteins. Given this, along with elevated demands for alternative proteins there appears to be scope to introduce a 'third' protein category when compiling FBDG. This could fall under the potential heading of 'fungal' protein, with scope to include mycelium such as mycoprotein within this, for which the evidence-base is accruing.
Collapse
|
8
|
A new method for protein characterization and classification using geometrical features for 3D face analysis: An example of tubulin structures. Proteins 2020; 89:e25993. [PMID: 32779779 DOI: 10.1002/prot.25993] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 07/22/2020] [Accepted: 07/26/2020] [Indexed: 11/12/2022]
Abstract
This article reports on the results of research aimed to translate biometric 3D face recognition concepts and algorithms into the field of protein biophysics in order to precisely and rapidly classify morphological features of protein surfaces. Both human faces and protein surfaces are free-forms and some descriptors used in differential geometry can be used to describe them applying the principles of feature extraction developed for computer vision and pattern recognition. The first part of this study focused on building the protein dataset using a simulation tool and performing feature extraction using novel geometrical descriptors. The second part tested the method on two examples, first involved a classification of tubulin isotypes and the second compared tubulin with the FtsZ protein, which is its bacterial analog. An additional test involved several unrelated proteins. Different classification methodologies have been used: a classic approach with a support vector machine (SVM) classifier and an unsupervised learning with a k-means approach. The best result was obtained with SVM and the radial basis function kernel. The results are significant and competitive with the state-of-the-art protein classification methods. This leads to a new methodological direction in protein structure analysis.
Collapse
|
9
|
Abstract
In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.
Collapse
|
10
|
Roles of membrane transporters: connecting the dots from sequence to phenotype. ANNALS OF BOTANY 2019; 124:201-208. [PMID: 31162525 PMCID: PMC6758574 DOI: 10.1093/aob/mcz066] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2018] [Accepted: 05/06/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND Plant membrane transporters are involved in diverse cellular processes underpinning plant physiology, such as nutrient acquisition, hormone movement, resource allocation, exclusion or sequestration of various solutes from cells and tissues, and environmental and developmental signalling. A comprehensive characterization of transporter function is therefore key to understanding and improving plant performance. SCOPE AND CONCLUSIONS In this review, we focus on the complexities involved in characterizing transporter function and the impact that this has on current genomic annotations. Specific examples are provided that demonstrate why sequence homology alone cannot be relied upon to annotate and classify transporter function, and to show how even single amino acid residue variations can influence transporter activity and specificity. Misleading nomenclature of transporters is often a source of confusion in transporter characterization, especially for people new to or outside the field. Here, to aid researchers dealing with interpretation of large data sets that include transporter proteins, we provide examples of transporters that have been assigned names that misrepresent their cellular functions. Finally, we discuss the challenges in connecting transporter function at the molecular level with physiological data, and propose a solution through the creation of new databases. Further fundamental in-depth research on specific transport (and other) proteins is still required; without it, significant deficiencies in large-scale data sets and systems biology approaches will persist. Reliable characterization of transporter function requires integration of data at multiple levels, from amino acid residue sequence annotation to more in-depth biochemical, structural and physiological studies.
Collapse
|
11
|
Abstract
Protein kinase C (PKC) is a superfamily of enzymes, which regulate numerous cellular responses. The specific function of PKC protein family is mainly governed by its individual protein domains. However, existing protein sequence classification methods based on sequence alignment and sequence analysis models focused little on the domain analysis. In this study, we introduce a novel protein kinase classification method that considers both domain sequence similarity and whole sequence similarity to quantify the evolutionary distance from a specific protein to a protein family. Using the natural vector method, we establish a 60-dimensional space, where each protein is uniquely represented by a vector. We also define a convex hull, consisting of the natural vectors corresponding to all members of a protein family. The sequence similarity between a protein and a protein family, therefore, can be quantified as the distance between the protein vector and the protein family convex hull. We have applied this method in a PKC sample library and the results showed a higher accuracy of classification compared with other alignment-free methods.
Collapse
|
12
|
RFAmyloid: A Web Server for Predicting Amyloid Proteins. Int J Mol Sci 2018; 19:ijms19072071. [PMID: 30013015 PMCID: PMC6073578 DOI: 10.3390/ijms19072071] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/10/2018] [Accepted: 07/12/2018] [Indexed: 12/22/2022] Open
Abstract
Amyloid is an insoluble fibrous protein and its mis-aggregation can lead to some diseases, such as Alzheimer’s disease and Creutzfeldt–Jakob’s disease. Therefore, the identification of amyloid is essential for the discovery and understanding of disease. We established a novel predictor called RFAmy based on random forest to identify amyloid, and it employed SVMProt 188-D feature extraction method based on protein composition and physicochemical properties and pse-in-one feature extraction method based on amino acid composition, autocorrelation pseudo acid composition, profile-based features and predicted structures features. In the ten-fold cross-validation test, RFAmy’s overall accuracy was 89.19% and F-measure was 0.891. Results were obtained by comparison experiments with other feature, classifiers, and existing methods. This shows the effectiveness of RFAmy in predicting amyloid protein. The RFAmy proposed in this paper can be accessed through the URL http://server.malab.cn/RFAmyloid/.
Collapse
|
13
|
Clustering of multi-domain protein sequences. Proteins 2018; 86:759-776. [PMID: 29675880 DOI: 10.1002/prot.25510] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Revised: 04/09/2018] [Accepted: 04/16/2018] [Indexed: 11/06/2022]
Abstract
The overall function of a multi-domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment-based methods commonly utilize domain-level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain-linker regions and classify multi-domain proteins. An alignment-free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi-domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi-domain protein sequences. In this article, CLAP-based classification has been explored on 5 datasets of multi-domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain-level CLAP-based classification scheme resulted in a clustering similar to that obtained from an alignment-based method. CLAP-based clusters obtained for full-length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi-domain proteins could be classified effectively by considering full-length sequences without a requirement of identification of domains in the sequence.
Collapse
|
14
|
Abstract
Computational identification of special protein molecules is a key issue in understanding protein function. It can guide molecular experiments and help to save costs. I assessed 18 papers published in the special issue of Int. J. Mol. Sci., and also discussed the related works. The computational methods employed in this special issue focused on machine learning, network analysis, and molecular docking. New methods and new topics were also proposed. There were in addition several wet experiments, with proven results showing promise. I hope our special issue will help in protein molecules identification researches.
Collapse
|
15
|
Special Protein Molecules Computational Identification. Int J Mol Sci 2018; 19:ijms19020536. [PMID: 29439426 PMCID: PMC5855758 DOI: 10.3390/ijms19020536] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Revised: 02/02/2018] [Accepted: 02/10/2018] [Indexed: 01/29/2023] Open
Abstract
Computational identification of special protein molecules is a key issue in understanding protein function. It can guide molecular experiments and help to save costs. I assessed 18 papers published in the special issue of Int. J. Mol. Sci., and also discussed the related works. The computational methods employed in this special issue focused on machine learning, network analysis, and molecular docking. New methods and new topics were also proposed. There were in addition several wet experiments, with proven results showing promise. I hope our special issue will help in protein molecules identification researches.
Collapse
|
16
|
Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases. Evol Bioinform Online 2017; 13:1176934317703401. [PMID: 28469382 PMCID: PMC5404901 DOI: 10.1177/1176934317703401] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 03/09/2017] [Indexed: 12/02/2022] Open
Abstract
Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases.
Collapse
|
17
|
A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification. IEEE Trans Nanobioscience 2016; 15:84-92. [PMID: 26863669 PMCID: PMC6245644 DOI: 10.1109/tnb.2016.2523501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.
Collapse
|
18
|
Identification of family determining residues in Jumonji-C lysine demethylases: A sequence-based, family wide classification. Proteins 2016; 84:397-407. [PMID: 26757344 PMCID: PMC4755873 DOI: 10.1002/prot.24986] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Revised: 12/31/2015] [Accepted: 01/04/2016] [Indexed: 12/12/2022]
Abstract
Histone post-translational modifications play a critical role in the regulation of gene expression. Methylation of lysines at N-terminal tails of histones has been shown to be involved in such regulation. While this modification was long considered to be irreversible, two different classes of enzymes capable of carrying out the demethylation of histone lysines were recently identified: the oxidases, such as LSD1, and the oxygenases (JmjC-containing). Here, a family-wide analysis of the second of these classes is proposed, with over 300 proteins studied at the sequence level. We show that a correlated evolution analysis yields some position/residue pairs which are critical at comparing JmjC sequences and enables the classification of JmjC domains into five families. A few positions appear more frequently among conditions, such as positions 23 (directly C-terminal to the second iron ligand), 24, 252 and 253 (directly N-terminal to a conserved Asn). Implications of family conditions are studied in detail on PHF2, revealing the meaningfulness of the sequence-derived conditions at the structural level. These results should help obtain insights on the diversity of JmjC-containing proteins solely by considering some of the amino acids present in their JmjC domain.
Collapse
|
19
|
Classification of proteins with shared motifs and internal repeats in the ECOD database. Protein Sci 2016; 25:1188-203. [PMID: 26833690 DOI: 10.1002/pro.2893] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 01/23/2016] [Accepted: 01/27/2016] [Indexed: 12/19/2022]
Abstract
Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain-like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade.
Collapse
|
20
|
Abstract
The CATH database is a classification of protein structures found in the Protein Data Bank (PDB). Protein structures are chopped into individual units of structural domains, and these domains are grouped together into superfamilies if there is sufficient evidence that they have diverged from a common ancestor during the process of evolution. A sister resource, Gene3D, extends this information by scanning sequence profiles of these CATH domain superfamilies against many millions of known proteins to identify related sequences. Thus the combined CATH-Gene3D resource provides confident predictions of the likely structural fold, domain organisation, and evolutionary relatives of these proteins. In addition, this resource incorporates annotations from a large number of external databases such as known enzyme active sites, GO molecular functions, physical interactions, and mutations. This unit details how to access and understand the information contained within the CATH-Gene3D Web pages, the downloadable data files, and the remotely accessible Web services.
Collapse
|
21
|
An array-based approach to determine different subtype and differentiation of non-small cell lung cancer. Am J Cancer Res 2015; 5:62-70. [PMID: 25553098 PMCID: PMC4265748 DOI: 10.7150/thno.10145] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2014] [Accepted: 09/16/2014] [Indexed: 11/18/2022] Open
Abstract
Simple and accurate methods of discriminating subtype or differentiation of human tumor are critical for designing treatment strategies and predicting disease prognosis, and the currently used method to determine the two important factors mainly depends on histological examination by microscopy observation, which is laborious, highly trained operator required, and prone to be disruptive due to individual-to-individual judgment. Here we report a novel array-based method based on the interaction of graphene oxide (GO) and single-strand DNA modified gold nanoparticles (ssDNA-AuNPs) to distinguish between different subtypes and grades of tumors through their overall intracellular proteome signatures. Strategically, we first select eight proteins at 0.5 nM concentration in buffer or 10 nM in human serum to verify the discriminant ability of our method, then choose adenocarcinoma and squamous-cell carcinoma that account for 90% non-small cell lung cancer, as well as their respective three tumor grades as model system to provide a realistic testing ground for clinical cancer analysis. Consequently, total differentiation between different subtype and grade of tumor tissues has been achieved with as little as 100 ng of intracellular protein, suggesting the high sensitivity and selectivity of this sensor array. Overall, this array-based approach may provide the possibility for unbiased and simplified personalized tumor classification diagnostics in the future.
Collapse
|
22
|
Geometrical comparison of two protein structures using Wigner-D functions. Proteins 2014; 82:2756-69. [PMID: 25043646 DOI: 10.1002/prot.24640] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Revised: 05/20/2014] [Accepted: 06/18/2014] [Indexed: 12/13/2022]
Abstract
In this article, we develop a quantitative comparison method for two arbitrary protein structures. This method uses a root-mean-square deviation characterization and employs a series expansion of the protein's shape function in terms of the Wigner-D functions to define a new criterion, which is called a "similarity value." We further demonstrate that the expansion coefficients for the shape function obtained with the help of the Wigner-D functions correspond to structure factors. Our method addresses the common problem of comparing two proteins with different numbers of atoms. We illustrate it with a worked example.
Collapse
|
23
|
Using linear algebra for protein structural comparison and classification. Genet Mol Biol 2009; 32:645-51. [PMID: 21637532 PMCID: PMC3036040 DOI: 10.1590/s1415-47572009000300032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Accepted: 05/25/2009] [Indexed: 11/23/2022] Open
Abstract
In this article, we describe a novel methodology to extract semantic characteristics from protein structures using linear algebra in order to compose structural signature vectors which may be used efficiently to compare and classify protein structures into fold families. These signatures are built from the pattern of hydrophobic intrachain interactions using Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) techniques. Considering proteins as documents and contacts as terms, we have built a retrieval system which is able to find conserved contacts in samples of myoglobin fold family and to retrieve these proteins among proteins of varied folds with precision of up to 80%. The classifier is a web tool available at our laboratory website. Users can search for similar chains from a specific PDB, view and compare their contact maps and browse their structures using a JMol plug-in.
Collapse
|
24
|
Is protein classification necessary? Toward alternative approaches to function annotation. Curr Opin Struct Biol 2009; 19:363-8. [PMID: 19269161 PMCID: PMC2745633 DOI: 10.1016/j.sbi.2009.02.001] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2009] [Accepted: 02/02/2009] [Indexed: 11/16/2022]
Abstract
The current nonredundant protein sequence database contains over seven million entries and the number of individual functional domains is significantly larger than this value. The vast quantity of data associated with these proteins poses enormous challenges to any attempt at function annotation. Classification of proteins into sequence and structural groups has been widely used as an approach to simplifying the problem. In this article we question such strategies. We describe how the multifunctionality and structural diversity of even closely related proteins confounds efforts to assign function on the basis of overall sequence or structural similarity. Rather, we suggest that strategies that avoid classification may offer a more robust approach to protein function annotation.
Collapse
|