1
|
Rahbar MR, Nezafat N, Morowvat MH, Savardashtaki A, Ghoshoon MB, Mehrabani-Zeinabad K, Ghasemi Y. Targeting Efficient Features of Urate Oxidase to Increase Its Solubility. Appl Biochem Biotechnol 2024; 196:6269-6295. [PMID: 38308671 DOI: 10.1007/s12010-023-04819-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/19/2023] [Indexed: 02/05/2024]
Abstract
With the demand for mass production of protein drugs, solubility has become a serious issue. Extrinsic and intrinsic factors both affect this property. A homotetrameric cofactor-free urate oxidase (UOX) is not sufficiently soluble. To engineer UOX for optimum solubility, it is important to identify the most effective factor that influences solubility. The most effective feature to target for protein engineering was determined by measuring various solubility-related factors of UOX. A large library of homologous sequences was obtained from the databases. The data was reduced to six enzymes from different organisms. On the basis of various sequence- and structure-derived elements, the most and the least soluble enzymes were defined. To determine the best protein engineering target for modification, features of the most and least soluble enzymes were compared. Metabacillus fastidiosus UOX was the most soluble enzyme, while Agrobacterium globiformis UOX was the least soluble. According to the comparison-constant method, positive surface patches caused by arginine residue distribution are appropriate targets for modification. Two Arg to Ala mutations were introduced to the least soluble enzyme to test this hypothesis. These mutations significantly enhanced the mutant's solubility. While different algorithms produced conflicting results, it was difficult to determine which proteins were most and least soluble. Solubility prediction requires multiple algorithms based on these controversies. Protein surfaces should be investigated regionally rather than globally, and both sequence and structural data should be considered. Several other biotechnological products could be engineered using the data reduction and comparison-constant methods used in this study.
Collapse
Affiliation(s)
- Mohammad Reza Rahbar
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Navid Nezafat
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, P.O. Box 71345-1583, Shiraz, Iran
| | - Mohammad Hossein Morowvat
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, P.O. Box 71345-1583, Shiraz, Iran
| | - Amir Savardashtaki
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Mohammad Bagher Ghoshoon
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, P.O. Box 71345-1583, Shiraz, Iran
| | - Kamran Mehrabani-Zeinabad
- Department of Biostatistics, Faculty of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Younes Ghasemi
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran.
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, P.O. Box 71345-1583, Shiraz, Iran.
| |
Collapse
|
2
|
Broni E, Miller WA. Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes. Biomedicines 2023; 11:512. [PMID: 36831052 PMCID: PMC9953644 DOI: 10.3390/biomedicines11020512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/03/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a serious global challenge requiring urgent and permanent therapeutic solutions. These solutions can only be engineered if the patterns and rate of mutations of the virus can be elucidated. Predicting mutations and the structure of proteins based on these mutations have become necessary for early drug and vaccine design purposes in anticipation of future viral mutations. The amino acid composition (AAC) of proteomes and individual viral proteins provide avenues for exploitation since AACs have been previously used to predict structure, shape and evolutionary rates. Herein, the frequency of amino acid residues found in 1637 complete proteomes belonging to 11 SARS-CoV-2 variants/lineages were analyzed. Leucine is the most abundant amino acid residue in the SARS-CoV-2 with an average AAC of 9.658% while tryptophan had the least abundance of 1.11%. The AAC and ranking of lysine and glycine varied in the proteome. For some variants, glycine had higher frequency and AAC than lysine and vice versa in other variants. Tryptophan was also observed to be the most intolerant to mutation in the various proteomes for the variants used. A correlogram revealed a very strong correlation of 0.999992 between B.1.525 (Eta) and B.1.526 (Iota) variants. Furthermore, isoleucine and threonine were observed to have a very strong negative correlation of -0.912, while cysteine and isoleucine had a very strong positive correlation of 0.835 at p < 0.001. Shapiro-Wilk normality test revealed that AAC values for all the amino acid residues except methionine showed no evidence of non-normality at p < 0.05. Thus, AACs of SARS-CoV-2 variants can be predicted using probability and z-scores. AACs may be beneficial in classifying viral strains, predicting viral disease types, members of protein families, protein interactions and for diagnostic purposes. They may also be used as a feature along with other crucial factors in machine-learning based algorithms to predict viral mutations. These mutation-predicting algorithms may help in developing effective therapeutics and vaccines for SARS-CoV-2.
Collapse
Affiliation(s)
- Emmanuel Broni
- Department of Medicine, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
| | - Whelton A. Miller
- Department of Medicine, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
- Department of Molecular Pharmacology & Neuroscience, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
| |
Collapse
|
3
|
Miotto M, Di Rienzo L, Corsi P, Ruocco G, Raimondo D, Milanetti E. Simulated Epidemics in 3D Protein Structures to Detect Functional Properties. J Chem Inf Model 2020; 60:1884-1891. [PMID: 32011881 DOI: 10.1021/acs.jcim.9b01027] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The outcome of an epidemic is closely related to the network of interactions between individuals. Likewise, protein functions depend on the 3D arrangement of their residues and the underlying energetic interaction network. Borrowing ideas from the theoretical framework that has been developed to address the spreading of real diseases, we study for the first time the diffusion of a fictitious epidemic inside the protein nonbonded interaction network, aiming to study network features and properties. Our approach allows us to probe the overall stability and the capability of propagating information in complex 3D structures, proving to be very efficient in addressing different problems, from the assessment of thermal stability to the identification of functional sites.
Collapse
Affiliation(s)
- Mattia Miotto
- Department of Physics, Sapienza University, Rome 00185, Italy.,Center for Life Nanoscience, Istituto Italiano di Tecnologia, Rome 00161, Italy
| | | | - Pietro Corsi
- Department of Science, Roma Tre University, Rome 00154, Italy
| | - Giancarlo Ruocco
- Department of Physics, Sapienza University, Rome 00185, Italy.,Center for Life Nanoscience, Istituto Italiano di Tecnologia, Rome 00161, Italy
| | - Domenico Raimondo
- Department of Molecular Medicine, Sapienza University, Rome 00161, Italy
| | - Edoardo Milanetti
- Department of Physics, Sapienza University, Rome 00185, Italy.,Center for Life Nanoscience, Istituto Italiano di Tecnologia, Rome 00161, Italy
| |
Collapse
|
4
|
Kumar A, Biswas P. Effect of site-directed point mutations on protein misfolding: A simulation study. Proteins 2019; 87:760-773. [PMID: 31017329 DOI: 10.1002/prot.25702] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Revised: 03/19/2019] [Accepted: 04/22/2019] [Indexed: 11/09/2022]
Abstract
A Monte Carlo simulation based sequence design method is proposed to investigate the role of site-directed point mutations in protein misfolding. Site-directed point mutations are incorporated in the designed sequences of selected proteins. While most mutated sequences correctly fold to their native conformation, some of them stabilize in other nonnative conformations and thus misfold/unfold. The results suggest that a critical number of hydrophobic amino acid residues must be present in the core of the correctly folded proteins, whereas proteins misfold/unfold if this number of hydrophobic residues falls below the critical limit. A protein can accommodate only a particular number of hydrophobic residues at the surface, provided a large number of hydrophilic residues are present at the surface and critical hydrophobicity of the core is preserved. Some surface sites are observed to be equally sensitive toward site-directed point mutations as the core sites. Point mutations with highly polar and charged amino acids increases the misfold/unfold propensity of proteins. Substitution of natural amino acids at sites with different number of nonbonded contacts suggests that both amino acid identity and its respective site-specificity determine the stability of a protein. A clash-match method is developed to calculate the number of matching and clashing interactions in the mutated protein sequences. While misfolded/unfolded sequences have a higher number of clashing and a lower number of matching interactions, the correctly folded sequences have a lower number of clashing and a higher number of matching interactions. These results are valid for different SCOP classes of proteins.
Collapse
Affiliation(s)
- Adesh Kumar
- Department of Chemistry, University of Delhi, Delhi, India
| | - Parbati Biswas
- Department of Chemistry, University of Delhi, Delhi, India
| |
Collapse
|
5
|
Enzyme annotation for orphan and novel reactions using knowledge of substrate reactive sites. Proc Natl Acad Sci U S A 2019; 116:7298-7307. [PMID: 30910961 PMCID: PMC6462048 DOI: 10.1073/pnas.1818877116] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Recent advances in synthetic biochemistry have resulted in a wealth of novel hypothetical enzymatic reactions that are not matched to protein-encoding genes, deeming them “orphan.” A large number of known metabolic enzymes are also orphan, leaving important gaps in metabolic network maps. Proposing genes for the catalysis of orphan reactions is critical for applications ranging from biotechnology to medicine. In this work, the computational method BridgIT identified potential enzymes of orphan reactions and nearly all theoretically possible biochemical transformations, providing candidate genes to catalyze these reactions to the research community. The BridgIT online tool will allow researchers to fill the knowledge gaps in metabolic networks and will act as a starting point for designing novel enzymes to catalyze nonnatural transformations. Thousands of biochemical reactions with characterized activities are “orphan,” meaning they cannot be assigned to a specific enzyme, leaving gaps in metabolic pathways. Novel reactions predicted by pathway-generation tools also lack associated sequences, limiting protein engineering applications. Associating orphan and novel reactions with known biochemistry and suggesting enzymes to catalyze them is a daunting problem. We propose the method BridgIT to identify candidate genes and catalyzing proteins for these reactions. This method introduces information about the enzyme binding pocket into reaction-similarity comparisons. BridgIT assesses the similarity of two reactions, one orphan and one well-characterized nonorphan reaction, using their substrate reactive sites, their surrounding structures, and the structures of the generated products to suggest enzymes that catalyze the most-similar nonorphan reactions as candidates for also catalyzing the orphan ones. We performed two large-scale validation studies to test BridgIT predictions against experimental biochemical evidence. For the 234 orphan reactions from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 2011 (a comprehensive enzymatic-reaction database) that became nonorphan in KEGG 2018, BridgIT predicted the exact or a highly related enzyme for 211 of them. Moreover, for 334 of 379 novel reactions in 2014 that were later cataloged in KEGG 2018, BridgIT predicted the exact or highly similar enzymes. BridgIT requires knowledge about only four connecting bonds around the atoms of the reactive sites to correctly annotate proteins for 93% of analyzed enzymatic reactions. Increasing to seven connecting bonds allowed for the accurate identification of a sequence for nearly all known enzymatic reactions.
Collapse
|
6
|
Iwański J, Suchacka G, Chodak G. Application of the Information Bottleneck method to discover user profiles in a Web store. JOURNAL OF ORGANIZATIONAL COMPUTING AND ELECTRONIC COMMERCE 2018. [DOI: 10.1080/10919392.2018.1444340] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Jacek Iwański
- Institute of Mathematics and Informatics, University of Opole, Opole, Poland
| | - Grażyna Suchacka
- Institute of Mathematics and Informatics, University of Opole, Opole, Poland
| | - Grzegorz Chodak
- Department of Operations Research, Wroclaw University of Science and Technology, Wroclaw, Poland
| |
Collapse
|
7
|
GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison. Sci Rep 2017; 7:14890. [PMID: 29097661 PMCID: PMC5668259 DOI: 10.1038/s41598-017-14411-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 10/11/2017] [Indexed: 12/26/2022] Open
Abstract
Initial protein structural comparisons were sequence-based. Since amino acids that are distant in the sequence can be close in the 3-dimensional (3D) structure, 3D contact approaches can complement sequence approaches. Traditional 3D contact approaches study 3D structures directly and are alignment-based. Instead, 3D structures can be modeled as protein structure networks (PSNs). Then, network approaches can compare proteins by comparing their PSNs. These can be alignment-based or alignment-free. We focus on the latter. Existing network alignment-free approaches have drawbacks: 1) They rely on naive measures of network topology. 2) They are not robust to PSN size. They cannot integrate 3) multiple PSN measures or 4) PSN data with sequence data, although this could improve comparison because the different data types capture complementary aspects of the protein structure. We address this by: 1) exploiting well-established graphlet measures via a new network alignment-free approach, 2) introducing normalized graphlet measures to remove the bias of PSN size, 3) allowing for integrating multiple PSN measures, and 4) using ordered graphlets to combine the complementary PSN data and sequence (specifically, residue order) data. We compare synthetic networks and real-world PSNs more accurately and faster than existing network (alignment-free and alignment-based), 3D contact, or sequence approaches.
Collapse
|
8
|
Saidijam M, Patching SG. Amino acid composition analysis of secondary transport proteins from Escherichia coli with relation to functional classification, ligand specificity and structure. J Biomol Struct Dyn 2015; 33:2205-20. [DOI: 10.1080/07391102.2014.998283] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Massoud Saidijam
- Department of Molecular Medicine and Genetics, Research Centre for Molecular Medicine, School of Medicine, Hamadan University of Medical Sciences , Hamadan, Iran
| | - Simon G. Patching
- Department of Molecular Medicine and Genetics, Research Centre for Molecular Medicine, School of Medicine, Hamadan University of Medical Sciences , Hamadan, Iran
| |
Collapse
|
9
|
Feiglin A, Ashkenazi S, Schlessinger A, Rost B, Ofran Y. Co-expression and co-localization of hub proteins and their partners are encoded in protein sequence. MOLECULAR BIOSYSTEMS 2014; 10:787-94. [PMID: 24457447 DOI: 10.1039/c3mb70411d] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Spatiotemporal coordination is a critical factor in biological processes. Some hubs in protein-protein interaction networks tend to be co-expressed and co-localized with their partners more strongly than others, a difference which is arguably related to functional differences between the hubs. Based on numerous analyses of yeast hubs, it has been suggested that differences in co-expression and co-localization are reflected in the structural and molecular characteristics of the hubs. We hypothesized that if indeed differences in co-expression and co-localization are encoded in the molecular characteristics of the protein, it may be possible to predict the tendency for co-expression and co-localization of human hubs based on features learned from systematically characterized yeast hubs. Thus, we trained a prediction algorithm on hubs from yeast that were classified as either strongly or weakly co-expressed and co-localized with their partners, and applied the trained model to 800 human hub proteins. We found that the algorithm significantly distinguishes between human hubs that are co-expressed and co-localized with their partners and hubs that are not. The prediction is based on sequence derived features such as "stickiness", i.e. the existence of multiple putative binding sites that enable multiple simultaneous interactions, "plasticity", i.e. the existence of predicted structural disorder which conjecturally allows for multiple consecutive interactions with the same binding site and predicted subcellular localization. These results suggest that spatiotemporal dynamics is encoded, at least in part, in the amino acid sequence of the protein and that this encoding is similar in yeast and in human.
Collapse
Affiliation(s)
- Ariel Feiglin
- The Goodman faculty of life sciences, Bar Ilan University, Ramat Gan 52900, Israel.
| | | | | | | | | |
Collapse
|
10
|
Rackovsky S, Scheraga HA. On the information content of protein sequences. J Biomol Struct Dyn 2011; 28:593-4; discussion 669-674. [PMID: 21142228 DOI: 10.1080/073911011010524957] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- S Rackovsky
- Dept of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine of NYU, One Gustave L Levy Place, New York, NY 10029, USA.
| | | |
Collapse
|
11
|
Abstract
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called "twilight zone" problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (approximately 15-30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver-operating characteristic measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the "twilight zone".
Collapse
Affiliation(s)
- Armando D. Solis
- Biological Sciences Department, New York City College of Technology, The City University of New York, Brooklyn, NY 11201, U.S.A. phone: (718) 260-5894, fax: (718)
| | - Shalom R. Rackovsky
- Department of Pharmacology and Systems Therapeutics, Box 1603, Mount Sinai School of Medicine, One Gustave L. Levy Place, New York, NY 10029, U.S.A. phone: (212) 241-4868, fax: (212) 996-7214
| |
Collapse
|
12
|
Yomtovian I, Teerakulkittipong N, Lee B, Moult J, Unger R. Composition bias and the origin of ORFan genes. ACTA ACUST UNITED AC 2010; 26:996-9. [PMID: 20231229 PMCID: PMC2853687 DOI: 10.1093/bioinformatics/btq093] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Motivation: Intriguingly, sequence analysis of genomes reveals that a large number of genes are unique to each organism. The origin of these genes, termed ORFans, is not known. Here, we explore the origin of ORFan genes by defining a simple measure called ‘composition bias’, based on the deviation of the amino acid composition of a given sequence from the average composition of all proteins of a given genome. Results: For a set of 47 prokaryotic genomes, we show that the amino acid composition bias of real proteins, random ‘proteins’ (created by using the nucleotide frequencies of each genome) and ‘proteins’ translated from intergenic regions are distinct. For ORFans, we observed a correlation between their composition bias and their relative evolutionary age. Recent ORFan proteins have compositions more similar to those of random ‘proteins’, while the compositions of more ancient ORFan proteins are more similar to those of the set of all proteins of the organism. This observation is consistent with an evolutionary scenario wherein ORFan genes emerged and underwent a large number of random mutations and selection, eventually adapting to the composition preference of their organism over time. Contact:ron@biocoml.ls.biu.ac.il Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Inbal Yomtovian
- Department of Computer Sciences, Bar-Ilan University, Ramat-Gan 52900, Israel
| | | | | | | | | |
Collapse
|
13
|
Sequence physical properties encode the global organization of protein structure space. Proc Natl Acad Sci U S A 2009; 106:14345-8. [PMID: 19706520 DOI: 10.1073/pnas.0903433106] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It is demonstrated that, properly represented, the amino acid composition of protein sequences contains the information necessary to delineate the global properties of protein structure space. A numerical representation of amino acid sequence in terms of a set of property factors is used, and the values of those property factors are averaged over individual sequences and then over sets of sequences belonging to structurally defined groups. These sequence sets then can be viewed as points in a 10-dimensional space, and the organization of that space, determined only by sequence properties, is similar at both local and global scales to that of the space of protein structures determined previously.
Collapse
|
14
|
Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci 2009; 7:27. [PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 08/09/2009] [Indexed: 02/07/2023] Open
Abstract
Background Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
Collapse
|
15
|
Lisewski AM. Random amino acid mutations and protein misfolding lead to Shannon limit in sequence-structure communication. PLoS One 2008; 3:e3110. [PMID: 18769673 PMCID: PMC2518838 DOI: 10.1371/journal.pone.0003110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Accepted: 07/28/2008] [Indexed: 11/18/2022] Open
Abstract
The transmission of genomic information from coding sequence to protein structure during protein synthesis is subject to stochastic errors. To analyze transmission limits in the presence of spurious errors, Shannon's noisy channel theorem is applied to a communication channel between amino acid sequences and their structures established from a large-scale statistical analysis of protein atomic coordinates. While Shannon's theorem confirms that in close to native conformations information is transmitted with limited error probability, additional random errors in sequence (amino acid substitutions) and in structure (structural defects) trigger a decrease in communication capacity toward a Shannon limit at 0.010 bits per amino acid symbol at which communication breaks down. In several controls, simulated error rates above a critical threshold and models of unfolded structures always produce capacities below this limiting value. Thus an essential biological system can be realistically modeled as a digital communication channel that is (a) sensitive to random errors and (b) restricted by a Shannon error limit. This forms a novel basis for predictions consistent with observed rates of defective ribosomal products during protein synthesis, and with the estimated excess of mutual information in protein contact potentials.
Collapse
Affiliation(s)
- Andreas Martin Lisewski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.
| |
Collapse
|
16
|
DNA codes and information: formal structures and relational causes. Acta Biotheor 2008; 56:205-32. [PMID: 18465197 DOI: 10.1007/s10441-008-9049-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2007] [Accepted: 03/09/2008] [Indexed: 02/03/2023]
Abstract
Recently the terms "codes" and "information" as used in the context of molecular biology have been the subject of much discussion. Here I propose that a variety of structural realism can assist us in rethinking the concepts of DNA codes and information apart from semantic criteria. Using the genetic code as a theoretical backdrop, a necessary distinction is made between codes qua symbolic representations and information qua structure that accords with data. Structural attractors are also shown to be entailed by the mapping relation that any DNA code is a part of (as the domain). In this framework, these attractors are higher-order informational structures that obviate any "DNA-centric" reductionism. In addition to the implications that are discussed, this approach validates the array of coding systems now recognized in molecular biology.
Collapse
|
17
|
Taguchi YH, Gromiha MM. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinformatics 2007; 8:404. [PMID: 17953741 PMCID: PMC2174517 DOI: 10.1186/1471-2105-8-404] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2007] [Accepted: 10/22/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predicting the three-dimensional structure of a protein from its amino acid sequence is a long-standing goal in computational/molecular biology. The discrimination of different structural classes and folding types are intermediate steps in protein structure prediction. RESULTS In this work, we have proposed a method based on linear discriminant analysis (LDA) for discriminating 30 different folding types of globular proteins using amino acid occurrence. Our method was tested with a non-redundant set of 1612 proteins and it discriminated them with the accuracy of 38%, which is comparable to or better than other methods in the literature. A web server has been developed for discriminating the folding type of a query protein from its amino acid sequence and it is available at http://granular.com/PROLDA/. CONCLUSION Amino acid occurrence has been successfully used to discriminate different folding types of globular proteins. The discrimination accuracy obtained with amino acid occurrence is better than that obtained with amino acid composition and/or amino acid properties. In addition, the method is very fast to obtain the results.
Collapse
Affiliation(s)
- Y-h Taguchi
- Department of Physics, Faculty of Science and Technology, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
- Institute for Science and Technology, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan
| | - M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| |
Collapse
|
18
|
Fujishima K, Komasa M, Kitamura S, Suzuki H, Tomita M, Kanai A. Proteome-wide prediction of novel DNA/RNA-binding proteins using amino acid composition and periodicity in the hyperthermophilic archaeon Pyrococcus furiosus. DNA Res 2007; 14:91-102. [PMID: 17573465 PMCID: PMC2779898 DOI: 10.1093/dnares/dsm011] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Proteins play a critical role in complex biological systems, yet about half of the proteins in publicly available databases are annotated as functionally unknown. Proteome-wide functional classification using bioinformatics approaches thus is becoming an important method for revealing unknown protein functions. Using the hyperthermophilic archaeon Pyrococcus furiosus as a model species, we used the support vector machine (SVM) method to discriminate DNA/RNA-binding proteins from proteins with other functions, using amino acid composition and periodicities as feature vectors. We defined this value as the composition score (CO) and periodicity score (PD). The P. furiosus proteins were classified into three classes (I–III) on the basis of the two-dimensional correlation analysis of CO score and PD score. As a result, approximately 87% of the functionally known proteins categorized as class I proteins (CO score + PD score > 0.6) were found to be DNA/RNA-binding proteins. Applying the two-dimensional correlation analysis to the 994 hypothetical proteins in P. furiosus, a total of 151 proteins were predicted to be novel DNA/RNA-binding protein candidates. DNA/RNA-binding activities of randomly chosen hypothetical proteins were experimentally verified. Six out of seven candidate proteins in class I possessed DNA/RNA-binding activities, supporting the efficacy of our method.
Collapse
Affiliation(s)
- Kosuke Fujishima
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa 252-8520, Japan
| | - Mizuki Komasa
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa 252-8520, Japan
| | - Sayaka Kitamura
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa 252-8520, Japan
| | - Haruo Suzuki
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa 252-8520, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Faculty of Environment and Information Studies, Keio University, Fujisawa 252-8520, Japan
| | - Akio Kanai
- Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan
- Faculty of Environment and Information Studies, Keio University, Fujisawa 252-8520, Japan
- To whom correspondence should be addressed. Tel. +81 235-29-0524. Fax. +81 235-29-0525. E-mail:
| |
Collapse
|