1
|
Zhang Y, Pang D, Wang Z, Ma L, Chen Y, Yang L, Xiao W, Yuan H, Chang F, Ouyang H. An integrative analysis of genotype-phenotype correlation in Charcot Marie Tooth type 2A disease with MFN2 variants: A case and systematic review. Gene 2023; 883:147684. [PMID: 37536398 DOI: 10.1016/j.gene.2023.147684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 06/24/2023] [Accepted: 07/31/2023] [Indexed: 08/05/2023]
Abstract
Dominant genetic variants in the mitofusin 2 (MFN2) gene lead to Charcot-Marie-Tooth type 2A (CMT2A), a neurodegenerative disease caused by genetic defects that directly damage axons. In this study, we reported a proband with a pathogenic variant in the GTPase domain of MFN2, c.494A > G (p.His165Arg). To date, at least 184 distinct MFN2 variants identified in 944 independent probands have been reported in 131 references. However, the field of medical genetics has long been challenged by how genetic variation in the MFN2 gene is associated with disease phenotypes. Here, by collating the MFN2 variant data and patient clinical information from Leiden Open Variant Database 3.0, NCBI clinvar database, and available related references in PubMed, we determined the mutation frequency, age of onset, sex ratio, and geographical distribution. Furthermore, the results of an analysis examining the relationship between variants and phenotypes from multiple genetic perspectives indicated that insertion and deletions (indels), copy number variants (CNVs), duplication variants, and nonsense mutations in single nucleotide variants (SNVs) tend to be pathogenic, and the results emphasized the importance of the GTPase domain to the structure and function of MFN2. Overall, three reliable classification methods of MFN2 genotype-phenotype associations provide insights into the prediction of CMT2A disease severity. Of course, there are still many MFN2 variants that have not been given clear clinical significance, which requires clinicians to make more accurate clinical diagnoses.
Collapse
Affiliation(s)
- Yuanzhu Zhang
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Daxin Pang
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China; Chongqing Research Institute, Jilin University, Chongqing 401120, China; Chongqing Jitang Biotechnology Research Institute Co., Ltd., Chongqing 401120, China.
| | - Ziru Wang
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Lerong Ma
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Yiwu Chen
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Lin Yang
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Wenyu Xiao
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China.
| | - Hongming Yuan
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China; Chongqing Research Institute, Jilin University, Chongqing 401120, China.
| | - Fei Chang
- Department of Orthopedics, The Second Hospital of Jilin University, Changchun 130022, China.
| | - Hongsheng Ouyang
- Key Laboratory of Zoonosis Research, Ministry of Education, College of Animal Sciences, Jilin University, Changchun 130062, China; Chongqing Research Institute, Jilin University, Chongqing 401120, China; Chongqing Jitang Biotechnology Research Institute Co., Ltd., Chongqing 401120, China.
| |
Collapse
|
2
|
Computer Simulation and Additive-Based Refolding Process of Cysteine-Rich Proteins: VEGF-A as a Model. Int J Pept Res Ther 2017. [DOI: 10.1007/s10989-017-9644-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
3
|
Fang Y, Middaugh CR, Fang J. In silico classification of proteins from acidic and neutral cytoplasms. PLoS One 2012; 7:e45585. [PMID: 23049817 PMCID: PMC3458925 DOI: 10.1371/journal.pone.0045585] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2012] [Accepted: 08/23/2012] [Indexed: 01/05/2023] Open
Abstract
Protein acidostability is a common problem in biopharmaceutical and other industries. However, it remains a great challenge to engineer proteins for enhanced acidostability because our knowledge of protein acidostabilization is still very limited. In this paper, we present a comparative study of proteins from bacteria with acidic (AP) and neutral cytoplasms (NP) using an integrated statistical and machine learning approach. We construct a set of 393 non-redundant AP-NP ortholog pairs and calculate a total of 889 sequence based features for these proteins. The pairwise alignments of these ortholog pairs are used to build a residue substitution propensity matrix between APs and NPs. We use Gini importance provided by the Random Forest algorithm to rank the relative importance of these features. A scoring function using the 10 most significant features is developed and optimized using a hill climbing algorithm. The accuracy of the score function is 86.01% in predicting AP-NP ortholog pairs and is 76.65% in predicting non-ortholog AP-NP pairs, suggesting that there are significant differences between APs and NPs which can be used to predict relative acidostability of proteins. The overall trends uncovered in the study can be used as general guidelines for designing acidostable proteins. To best of our knowledge, this work represents the first systematic comparative study of the acidostable proteins and their non-acidostable orthologs.
Collapse
Affiliation(s)
- Yaping Fang
- Applied Bioinformatics Laboratory, The University of Kansas, Lawrence, Kansas, United States of America
| | - C. Russell Middaugh
- Department of Pharmaceutical Chemistry, The University of Kansas, Lawrence, Kansas, United States of America
| | - Jianwen Fang
- Applied Bioinformatics Laboratory, The University of Kansas, Lawrence, Kansas, United States of America
- * E-mail:
| |
Collapse
|
4
|
Lee TY, Lu CT, Chen SA, Bretaña NA, Cheng TH, Su MG, Huang KY. Investigation and identification of protein γ-glutamyl carboxylation sites. BMC Bioinformatics 2011; 12 Suppl 13:S10. [PMID: 22372765 PMCID: PMC3278826 DOI: 10.1186/1471-2105-12-s13-s10] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND Carboxylation is a modification of glutamate (Glu) residues which occurs post-translation that is catalyzed by γ-glutamyl carboxylase in the lumen of the endoplasmic reticulum. Vitamin K is a critical co-factor in the post-translational conversion of Glu residues to γ-carboxyglutamate (Gla) residues. It has been shown that the process of carboxylation is involved in the blood clotting cascade, bone growth, and extraosseous calcification. However, studies in this field have been limited by the difficulty of experimentally studying substrate site specificity in γ-glutamyl carboxylation. In silico investigations have the potential for characterizing carboxylated sites before experiments are carried out. RESULTS Because of the importance of γ-glutamyl carboxylation in biological mechanisms, this study investigates the substrate site specificity in carboxylation sites. It considers not only the composition of amino acids that surround carboxylation sites, but also the structural characteristics of these sites, including secondary structure and solvent-accessible surface area (ASA). The explored features are used to establish a predictive model for differentiating between carboxylation sites and non-carboxylation sites. A support vector machine (SVM) is employed to establish a predictive model with various features. A five-fold cross-validation evaluation reveals that the SVM model, trained with the combined features of positional weighted matrix (PWM), amino acid composition (AAC), and ASA, yields the highest accuracy (0.892). Furthermore, an independent testing set is constructed to evaluate whether the predictive model is over-fitted to the training set. CONCLUSIONS Independent testing data that did not undergo the cross-validation process shows that the proposed model can differentiate between carboxylation sites and non-carboxylation sites. This investigation is the first to study carboxylation sites and to develop a system for identifying them. The proposed method is a practical means of preliminary analysis and greatly diminishes the total number of potential carboxylation sites requiring further experimental confirmation.
Collapse
Affiliation(s)
- Tzong-Yi Lee
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 320, Taiwan.
| | | | | | | | | | | | | |
Collapse
|
5
|
Li Y, Middaugh CR, Fang J. A novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting relative thermostability of protein mutants. BMC Bioinformatics 2010; 11:62. [PMID: 20109199 PMCID: PMC3098108 DOI: 10.1186/1471-2105-11-62] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2009] [Accepted: 01/28/2010] [Indexed: 11/10/2022] Open
Abstract
Background The ability to design thermostable proteins is theoretically important and practically useful. Robust and accurate algorithms, however, remain elusive. One critical problem is the lack of reliable methods to estimate the relative thermostability of possible mutants. Results We report a novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting the relative thermostability of protein mutants. The scoring function was developed based on an elaborate analysis of a set of features calculated or predicted from 540 pairs of hyperthermophilic and mesophilic protein ortholog sequences. It was constructed by a linear combination of ten important features identified by a feature ranking procedure based on the random forest classification algorithm. The weights of these features in the scoring function were fitted by a hill-climbing algorithm. This scoring function has shown an excellent ability to discriminate hyperthermophilic from mesophilic sequences. The prediction accuracies reached 98.9% and 97.3% in discriminating orthologous pairs in training and the holdout testing datasets, respectively. Moreover, the scoring function can distinguish non-homologous sequences with an accuracy of 88.4%. Additional blind tests using two datasets of experimentally investigated mutations demonstrated that the scoring function can be used to predict the relative thermostability of proteins and their mutants at very high accuracies (92.9% and 94.4%). We also developed an amino acid substitution preference matrix between mesophilic and hyperthermophilic proteins, which may be useful in designing more thermostable proteins. Conclusions We have presented a novel scoring function which can distinguish not only HP/MP ortholog pairs, but also non-homologous pairs at high accuracies. Most importantly, it can be used to accurately predict the relative stability of proteins and their mutants, as demonstrated in two blind tests. In addition, the residue substitution preference matrix assembled in this study may reflect the thermal adaptation induced substitution biases. A web server implementing the scoring function and the dataset used in this study are freely available at http://www.abl.ku.edu/thermorank/.
Collapse
Affiliation(s)
- Yunqi Li
- Applied Bioinformatics Laboratory, the University of Kansas, Lawrence, KS 66047, USA
| | | | | |
Collapse
|
6
|
Su Y, Zou Z, Feng S, Zhou P, Cao L. The acidity of protein fusion partners predominantly determines the efficacy to improve the solubility of the target proteins expressed in Escherichia coli. J Biotechnol 2007; 129:373-82. [PMID: 17374413 DOI: 10.1016/j.jbiotec.2007.01.015] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2006] [Revised: 01/14/2007] [Accepted: 01/18/2007] [Indexed: 11/17/2022]
Abstract
Maximization of the soluble protein expression in Escherichia coli (E. coli) via the fusion expression strategy is usually preferred for academic, industrial and pharmaceutical purposes. In this study, a set of distinct protein fusion partners were comparatively evaluated to promote the soluble expression of two target proteins including the bovine enterokinase largely prone to aggregation and the green fluorescent protein with moderate native solubility. Within protein attributes that are putatively involved in protein solubility, the protein acidity was of particular concern. Our results explicitly indicated the protein fusion partners with a stronger acidity remarkably exhibited a higher capacity to enhance the solubility of the target proteins. Among them, msyB, an E. coli acidic protein that suppresses the mutants lacking function of protein export, was revealed as an excellent protein fusion partner with the distinguished features including high potential to enhance protein solubility, efficient expression, relatively small size and the origin of E. coli itself. In principle, our results confirmed the modified solubility model of Wilkinson-Harrison and especially deepened understanding its essence. Meanwhile, the roles of other parameters such as protein hydrophilicity in solubility enhancement were discussed, a guideline to design or search an optimum protein solubility enhancer was also proposed.
Collapse
Affiliation(s)
- Yu Su
- School of Life Sciences, East China Normal University, Zhongshan North Road 3663, Shanghai 200062, China
| | | | | | | | | |
Collapse
|
7
|
Ofran Y, Margalit H. Proteins of the same fold and unrelated sequences have similar amino acid composition. Proteins 2006; 64:275-9. [PMID: 16565950 DOI: 10.1002/prot.20964] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
It is well established that there is a relationship between the amino acid composition of a protein and its structural class (i.e., alpha, beta, alpha + beta, or alpha/beta). Several studies have even shown the power of amino acid composition in predicting the secondary structure class of a protein. Herein, we show that significant similarity in amino acid composition exists not only between proteins of the same class, but even between proteins of the same fold. To test conjectural explanations for this phenomenon, we analyzed a set of structurally similar proteins that are dissimilar in sequence. Based on this analysis, we suggest that specific residues that are involved in intramolecular interactions may account for this surprising relationship between composition and structure.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.
| | | |
Collapse
|
8
|
Sun XD, Huang RB. Prediction of protein structural classes using support vector machines. Amino Acids 2006; 30:469-75. [PMID: 16622605 DOI: 10.1007/s00726-005-0239-0] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2005] [Accepted: 07/12/2005] [Indexed: 11/24/2022]
Abstract
The support vector machine, a machine-learning method, is used to predict the four structural classes, i.e. mainly alpha, mainly beta, alpha-beta and fss, from the topology-level of CATH protein structure database. For the binary classification, any two structural classes which do not share any secondary structure such as alpha and beta elements could be classified with as high as 90% accuracy. The accuracy, however, will decrease to less than 70% if the structural classes to be classified contain structure elements in common. Our study also shows that the dimensions of feature space 20(2) = 400 (for dipeptide) and 20(3) = 8 000 (for tripeptide) give nearly the same prediction accuracy. Among these 4 structural classes, multi-class classification gives an overall accuracy of about 52%, indicating that the multi-class classification technique in support of vector machines may still need to be further improved in future investigation.
Collapse
Affiliation(s)
- X-D Sun
- College of Life Science and Biotechnology, Guangxi University, Nanning, Guangxi, China
| | | |
Collapse
|
9
|
Milac AL, Avram S, Petrescu AJ. Evaluation of a neural networks QSAR method based on ligand representation using substituent descriptors. Application to HIV-1 protease inhibitors. J Mol Graph Model 2005; 25:37-45. [PMID: 16325439 DOI: 10.1016/j.jmgm.2005.09.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2005] [Revised: 06/17/2005] [Accepted: 09/29/2005] [Indexed: 11/18/2022]
Abstract
We present here a neural networks method designed to predict biological activity based on a local representation of the ligand. The compounds of the series are represented by a vector mapping for each of four substituent properties: volume, log P, dipole moment and a simple 'steric' parameter relating to its shape. This ligand representation was tested using neural networks on a set of 42 cyclic-urea derivatives, inhibiting HIV-1 protease. The leave-one-out cross-validation using all descriptors in the input gave a correlation factor between prediction and experiment of 0.76 for the overall set and 0.88 when three outliers were left out. To rank the significance of the four descriptors, we further tested all combinations of two and three parameters for each substituent, using two disjunctive testing sets of five inhibitors. In these sets, vectors with extreme descriptor values were used either in the training or the testing set (sets A and B, respectively). The method is a very good interpolator (set A, 95+/-2% accuracy) but a less effective extrapolator (set B, 85+/-2% accuracy). Generally, the combinations including the 'steric' parameter predict better than average, while those containing the volume are less effective. The best prediction, 98.8+/-1.2%, was obtained when log P, the dipole and the steric parameter were used on set A. At the opposite end, the lowest ranked descriptor set was obtained when replacing log P with the volume, giving 92.3+/-6.7% accuracy over the set A.
Collapse
Affiliation(s)
- Adina-Luminiţa Milac
- Institute of Biochemistry, Splaiul Independenţei 296, Sector 6, Bucharest, Romania
| | | | | |
Collapse
|
10
|
Huang Y, Cai J, Ji L, Li Y. Classifying G-protein coupled receptors with bagging classification tree. Comput Biol Chem 2004; 28:275-80. [PMID: 15548454 DOI: 10.1016/j.compbiolchem.2004.08.001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2004] [Revised: 08/05/2004] [Accepted: 08/06/2004] [Indexed: 11/17/2022]
Abstract
G-protein coupled receptors (GPCRs) play a key role in different biological processes, such as regulation of growth, death and metabolism of cells. They are major therapeutic targets of numerous prescribed drugs. However, the ligand specificity of many receptors is unknown and there is little structural information available. Bioinformatics may offer one approach to bridge the gap between sequence data and functional knowledge of a receptor. In this paper, we use a bagging classification tree algorithm to predict the type of the receptor based on its amino acid composition. The prediction is performed for GPCR at the sub-family and sub-sub-family level. In a cross-validation test, we achieved an overall predictive accuracy of 91.1% for GPCR sub-family classification, and 82.4% for sub-sub-family classification. These results demonstrate the applicability of this relative simple method and its potential for improving prediction accuracy.
Collapse
Affiliation(s)
- Ying Huang
- Department of Automation, MOE Key Laboratory of Bioinformatics, Institute of Bioinformatics, Tsinghua University, Beijing 10084, China.
| | | | | | | |
Collapse
|
11
|
Abstract
A correlation analysis among 20 amino acids is performed for four protein structural classes (alpha, beta, alpha/beta, and alpha+beta) in a total of 204 proteins. The correlation relationships among amino acids can be classified into the following four types: (1) strong positive correlation, (2) strong negative correlation, (3) weak correlation, and (4) no correlation. The correlation relationships are different for different proteins and are correlated with the features of their structural classes. The amino acids with the weak correlation relationship can be treated as the independent basis functions for the space where proteins are defined. The amino acids with large correlation coefficients are linear correlative with each other and they are not independent. The strong correlation among amino acids reflects their mutual constrained relationship, as exhibited by their relevant structural features. The information obtained through the correlation analysis is used for predicting protein structural classes and a better prediction quality is obtained than that by the simple geometry distance methods without taking into account the correlation effects.
Collapse
Affiliation(s)
- Qishi Du
- Tianjin Institute of Bioinformatics and Drug Discovery (TIBDD), Tianjin Normal University, 241 Weijin Road, Hexi District, Tianjin 300074, China
| | | | | |
Collapse
|
12
|
Jin L, Fang W, Tang H. Prediction of protein structural classes by a new measure of information discrepancy. Comput Biol Chem 2003; 27:373-80. [PMID: 12927111 DOI: 10.1016/s1476-9271(02)00087-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Since it was observed that the structural class of a protein is related to its amino acid composition, various methods based on amino acid composition have been proposed to predict protein structural classes. Though those methods are effective to some degree, their predictive quality is confined because amino acid composition cannot sufficiently include the information of protein sequences. In this paper, a measure of information discrepancy is applied to the prediction of protein structural classes; different from the previous methods, this new approach is based on the comparisons of subsequence distributions; therefore, the effect of residue order on protein structure is taken into account. The predictive results of the new approach on the same data set are better than those of the previous methods. As to a data set of 1401 sequences with no more than 30% redundancy, the overall correctness rates of resubstitution test and Jackknife test are 99.4 and 75.02%, respectively, and to other data sets the similar results are also obtained. All tests demonstrate that the residue order along protein sequences plays an important role on recognition of protein structural classes, especially for alpha/beta proteins and alpha+beta proteins. In addition, the tests also show that the new method is simple and efficient.
Collapse
Affiliation(s)
- Lixia Jin
- Institute of Computational Biology and Bioinformatics, Dalian University of Technology, 116025, Dalian, People's Republic of China.
| | | | | |
Collapse
|
13
|
Abstract
A family of global geometric measures is constructed for protein structure classification. These measures originate from integral formulas of Vassiliev knot invariants and give rise to a unique classification scheme. Our measures can better discriminate between many known protein structures than the simple measures of the secondary structure content of these protein structures.
Collapse
Affiliation(s)
- Peter Røgen
- Department of Mathematics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark.
| | | |
Collapse
|
14
|
Yu CS, Wang JY, Yang JM, Lyu PC, Lin CJ, Hwang JK. Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameter sets. Proteins 2003; 50:531-6. [PMID: 12577258 DOI: 10.1002/prot.10313] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In the coarse-grained fold assignment of major protein classes, such as all-alpha, all-beta, alpha + beta, alpha/beta proteins, one can easily achieve high prediction accuracy from primary amino acid sequences. However, the fine-grained assignment of folds, such as those defined in the Structural Classification of Proteins (SCOP) database, presents a challenge due to the larger amount of folds available. Recent study yielded reasonable prediction accuracy of 56.0% on an independent set of 27 most populated folds. In this communication, we apply the support vector machine (SVM) method, using a combination of protein descriptors based on the properties derived from the composition of n-peptide and jury voting, to the fine-grained fold prediction, and are able to achieve an overall prediction accuracy of 69.6% on the same independent set-significantly higher than the previous results. On 10-fold cross-validation, we obtained a prediction accuracy of 65.3%. Our results show that SVM coupled with suitable global sequence-coding schemes can significantly improve the fine-grained fold prediction. Our approach should be useful in structure prediction and modeling.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Biological Science and Technology, National Chiao Tung University, Hsin Chu, Taiwan
| | | | | | | | | | | |
Collapse
|
15
|
Cai YD, Liu XJ, Xu XB, Chou KC. Prediction of protein structural classes by support vector machines. COMPUTERS & CHEMISTRY 2002; 26:293-6. [PMID: 11868916 DOI: 10.1016/s0097-8485(01)00113-9] [Citation(s) in RCA: 195] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In this paper, we apply a new machine learning method which is called support vector machine to approach the prediction of protein structural class. The support vector machine method is performed based on the database derived from SCOP which is based upon domains of known structure and the evolutionary relationships and the principles that govern their 3D structure. As a result, high rates of both self-consistency and jackknife test are obtained. This indicates that the structural class of a protein inconsiderably correlated with its amino and composition, and the support vector machine can be referred as a powerful computational tool for predicting the structural classes of proteins.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences.
| | | | | | | |
Collapse
|
16
|
Cai YD, Liu XJ, Xu XB, Zhou GP. Support vector machines for predicting protein structural class. BMC Bioinformatics 2001; 2:3. [PMID: 11483157 PMCID: PMC35360 DOI: 10.1186/1471-2105-2-3] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2001] [Accepted: 06/29/2001] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We apply a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. Support Vector Machine method is performed based on the database derived from SCOP, in which protein domains are classified based on known structures and the evolutionary relationships and the principles that govern their 3-D structure. RESULTS High rates of both self-consistency and jackknife tests are obtained. The good results indicate that the structural class of a protein is considerably correlated with its amino acid composition. CONCLUSIONS It is expected that the Support Vector Machine method and the elegant component-coupled method, also named as the covariant discrimination algorithm, if complemented with each other, can provide a powerful computational tool for predicting the structural classes of proteins.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai, 200233, China
| | - Xiao-Jun Liu
- Institute of Cell, Animal and Population Biology University of Edinburgh, West Mains Road, Edinburgh EH9 3JT, U.K
| | - Xue-biao Xu
- Department of Computing Science, University of Wales, College of Cardiff, Queens Buildings, Newport Road, PO Box 916, Cardiff CF2 3XF, U.K
| | - Guo-Ping Zhou
- Department of Structural Biology, Burnham Institute, La Jolla, California 92037, USA
| |
Collapse
|
17
|
Abstract
A tight turn in protein structure is defined as a site where (i) a polypeptide chain reverses its overall direction, i.e., leads the chain to fold back on itself by nearly 180 degrees, and (ii) the amino acid residues directly involved in forming the turn are no more than six. Tight turns are generally categorized as delta-turn, gamma-turn, beta-turn, alpha-turn, and pi-turn, which are formed by two-, three-, four-, five-, and six-amino-acid residues, respectively. According to the folding mode, each of such tight turns can be further classified into several different types. Tight turns play an important role in globular proteins from both the structural and functional points of view. In view of this, various efforts have been made to predict tight turns and their types. This Review summarizes the development in this area, with an emphasis focused on the most recent work concerned that is featured by the sequence-coupled model. Meanwhile, the future challenge in this area has also been briefly addressed.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
18
|
Affiliation(s)
- Y Cai
- Shanghai Research Centre of Biotechnolog, Chinese Academy of Sciences, 200233, Shanghai, China.
| | | |
Collapse
|
19
|
Abstract
All existing algorithms for predicting the content of protein secondary structure elements have been based on the conventional amino-acid-composition, where no sequence coupling effects are taken into account. In this article, an algorithm was developed for predicting the content of protein secondary structure elements that was based on a new amino-acid-composition, in which the sequence coupling effects are explicitly included through a series of conditional probability elements. The prediction was examined by a self-consistency test and an independent dataset test. Both indicated a remarkable improvement obtained when using the current algorithm to predict the contents of alpha-helix, beta-sheet, beta-bridge, 3(10)-helix, pi-helix, H-bonded turn, bend and random coil. Examples of the improved accuracy by introducing the new amino-acid-composition, as well as its impact on the study of protein structural class and biologically function, are discussed.
Collapse
Affiliation(s)
- W Liu
- Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, MI 49007-4940, USA
| | | |
Collapse
|
20
|
Galat A. Variations of sequences and amino acid compositions of proteins that sustain their biological functions: An analysis of the cyclophilin family of proteins. Arch Biochem Biophys 1999; 371:149-62. [PMID: 10545201 DOI: 10.1006/abbi.1999.1434] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
The sequences of the ubiquitous and phylogenetically diversified cyclophilin family of proteins were divided into six groups, namely, vertebrates, invertebrates, other metazoa, plants, fungi, and prokaryotes. These groups of sequences were aligned with the multiple sequence alignment program Clustal-W. The variations of amino acid substitutions and amino acid compositions for these six groups of cyclophilins were calculated using a novel suite of multiple-sequence alignment analysis routines. The cyclophilins from vertebrates can be divided for at least two distinct structural classes that differ from each other by a variable-length amino acid insert within the loop that links alpha-helix II and beta-strand III. A similar structural feature is also present in the other groups of cyclophilins, namely, those from invertebrates, other metazoa, plants, and fungi. The sequences of cyclophilins from fungi and prokaryotes are more diversified than those from vertebrates, and their alterations involve structures other than the amino acid inserts within the loops. Variations of the hydrophobicity and bulkiness of amino acid substitutions of the aligned sequences were calculated for each group of cyclophilins and for the alignment of all the sequences. The variations have clear asymmetry that may signify the need for modification of the physical properties of certain fragments of cyclophilins that are involved in interactions with various cellular components in the evolving environment.
Collapse
Affiliation(s)
- A Galat
- Département d'Ingénierie et d'Etudes des Protéines, DSV/CEA, Gif-sur-Yvette Cedex, CE-Saclay, F-91191, France
| |
Collapse
|
21
|
Abstract
The three-dimensional structure of a protein is uniquely dictated by its primary sequence. However, owing to the very high degenerative nature of the sequence-structure relationship, proteins are generally folded into one of only a few structural classes that are closely correlated with the amino-acid composition. This suggests that the interaction among the components of amino acid composition may play a considerable role in determining the structural class of a protein. To quantitatively test such a hypothesis at a deeper level, three potential functions, U((0)), U((1)), and U((2)), were formulated that respectively represent the 0th-order, 1st-order, and 2nd-order approximations for the interaction among the components of the amino acid composition in a protein. It was observed that the correct rates in recognizing protein structural classes by U((2)) are significantly higher than those by U((0)) and U((1)), indicating that an algorithm that can more completely incorporate the interaction contributions will yield better recognition quality, and hence further demonstrate that the interaction among the components of amino acid composition is an important driving force in determining the structural class of a protein during the sequence folding process.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
22
|
Apostol I, Szpankowski W. Indexing and mapping of proteins using a modified nonlinear Sammon projection. J Comput Chem 1999. [DOI: 10.1002/(sici)1096-987x(19990730)20:10<1049::aid-jcc7>3.0.co;2-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
23
|
|
24
|
|
25
|
Chou KC. Using pair-coupled amino acid composition to predict protein secondary structure content. JOURNAL OF PROTEIN CHEMISTRY 1999; 18:473-80. [PMID: 10449044 DOI: 10.1023/a:1020696810938] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The pair-coupled amino acid composition is introduced to predict the secondary structure contents of a protein. Compared with the existing methods all based on singlewise amino acid composition as defined in a 20D (dimensional) space, this represents a step forward to the consideration of the sequence coupling effect. The test results indicate that the introduction of the pair-coupled amino acid composition can significantly improve the prediction quality. It is anticipated that the concept of the pair-coupled amino acid composition can be used to simplify the formulation of sequence coupling (or sequence order) effects and to study many other features of proteins as well.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, Michigan 49007-4940, USA
| |
Collapse
|
26
|
Gerstein M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. FOLDING & DESIGN 1999; 3:497-512. [PMID: 9889159 DOI: 10.1016/s1359-0278(98)00066-2] [Citation(s) in RCA: 100] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
BACKGROUND Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation. RESULTS The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of 'biophysical proteins' on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense. CONCLUSIONS The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA.
| |
Collapse
|
27
|
Schneider G, Wrede P. Artificial neural networks for computer-based molecular design. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 1998; 70:175-222. [PMID: 9830312 DOI: 10.1016/s0079-6107(98)00026-1] [Citation(s) in RCA: 135] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The theory of artificial neural networks is briefly reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this field are highlighted and pioneering publications are discussed. Applications of several types of artificial neural networks to compound classification, modelling of structure-activity relationships, biological target identification, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed.
Collapse
Affiliation(s)
- G Schneider
- F. Hoffmann-La Roche Ltd., Pharmaceuticals Division, Basel, Switzerland.
| | | |
Collapse
|
28
|
Zhou GP. An intriguing controversy over protein structural class prediction. JOURNAL OF PROTEIN CHEMISTRY 1998; 17:729-38. [PMID: 9988519 DOI: 10.1023/a:1020713915365] [Citation(s) in RCA: 290] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
A recent report by Bahar et al. [(1997), Proteins 29, 172-185] indicates that the coupling effects among different amino acid components as originally formulated by K. C. Chou [(1995), Proteins 21, 319-344] are important for improving the prediction of protein structural classes. These authors have further proposed a compact lattice model to illuminate the physical insight contained in the component-coupled algorithm. However, a completely opposite result was concluded by Eisenhaber et al. [(1996), Proteins 25, 169 179], using a different dataset constructed according to their definition. To address such an intriguing controversy, tests were conducted by various approaches for the datasets from an objective database, the SCOP database [Murzin et al. (1995), J. Mol. Biol. 247, 536-540]. The results obtained by both self-consistency and jackknife tests indicate that the overall rates of correct prediction by the algorithm incorporating the coupling effect among different amino acid components are significantly higher than those by the algorithms without counting such an effect. This is fully consistent with the physical reality that the folding of a protein is the result of a collective interaction among its constituent amino acid residues, and hence the coupling effects of different amino acid components must be incorporated in order to improve the prediction quality. It was found by a revisiting the calculation procedures by Eisenhaber et al. that there was a conceptual mistake in constructing the structural class datasets and a systematic mistake in applying the component-coupled algorithm. These findings are informative for understanding and utilizing the component-coupled algorithm to study the structural classes of proteins.
Collapse
Affiliation(s)
- G P Zhou
- Stanford Magnetic Resonance Lab, Stanford University, California 94305, USA
| |
Collapse
|
29
|
Diederichs K, Freigang J, Umhau S, Zeth K, Breed J. Prediction by a neural network of outer membrane beta-strand protein topology. Protein Sci 1998; 7:2413-20. [PMID: 9828008 PMCID: PMC2143870 DOI: 10.1002/pro.5560071119] [Citation(s) in RCA: 73] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
An artificial neural network (NN) was trained to predict the topology of bacterial outer membrane (OM) beta-strand proteins. Specifically, the NN predicts the z-coordinate of Calpha atoms in a coordinate frame with the outer membrane in the xy-plane, such that low z-values indicate periplasmic turns, medium z-values indicate transmembrane beta-strands, and high z-values indicate extracellular loops. To obtain a training set, seven OM proteins (porins) with structures known to high resolution were aligned with their pores along the z-axis. The relationship between Calpha z-values and topology was thereby established. To predict the topology of other OM proteins, all seven porins were used for the training set. Z-values (topologies) were predicted for two porins with hitherto unknown structure and for OM proteins not belonging to the porin family, all with insignificant sequence homology to the training set. The results of topology prediction compare favorably with experimental topology data.
Collapse
|
30
|
|
31
|
Abstract
Artificial neural networks provide a unique computing architecture whose potential has attracted interest from researchers across different disciplines. As a technique for computational analysis, neural network technology is very well suited for the analysis of molecular sequence data. It has been applied successfully to a variety of problems, ranging from gene identification, to protein structure prediction and sequence classification. This article provides an overview of major neural network paradigms, discusses design issues, and reviews current applications in DNA/RNA and protein sequence analysis.
Collapse
Affiliation(s)
- C H Wu
- Department of Epidemiology/Biomathematics, University of Texas Health Center at Tyler 75710, USA.
| |
Collapse
|
32
|
Dandekar T, König R. Computational methods for the prediction of protein folds. BIOCHIMICA ET BIOPHYSICA ACTA 1997; 1343:1-15. [PMID: 9428653 DOI: 10.1016/s0167-4838(97)00132-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
33
|
|
34
|
Ojasoo T, Doré JC. Taxonomy of nuclear receptors and SERPINS by multivariate analysis of amino-acid composition. J Steroid Biochem Mol Biol 1996; 58:167-81. [PMID: 8809198 DOI: 10.1016/0960-0760(96)00029-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The global amino-acid composition of a protein, although a cruder variable than sequence, is nevertheless informative and has been correlated with protein structural class. In the present study, we have applied complementary multivariate methods based on chi 2-metrics (correspondence factor analysis (CFA), minimum spanning tree (MST), ascending hierarchical classification (AHC)) to the analysis of the amino-acid frequency patterns of the C-terminal domain of 39 members of the nuclear receptor superfamily. The correlations we observed among receptors by this simple approach were, with few exceptions, in line with published phylogenetic dendrograms derived by sequence alignment. Further multivariate analyses were performed on the receptor population combined with 26 serine protease inhibitors (SERPINS) in view of the analogies detected between these superfamilies by hydrophobic cluster analysis (HCA), which were at the origin of the choice of alpha 1-antitrypsin as a 3-dimensional (3D) model for the receptor hormone-binding domain. Both the MST and AHC identified two distinct protein populations which in the principal phi 1 phi 2 CFA plot showed virtually no overlap, thus suggesting that receptors and SERPINS have different overall folding patterns, although the lower-order phi 3 phi 4 plot did reveal some similarities, essentially in the use of hydrophobic amino acids, that might account for analogies in HCA patterns. Receptors had a preference for those amino acids that are more frequent in alpha-helices and SERPINS for those in beta-strands and also tended to use different amino acids in turns. We therefore propose that multivariate analysis of amino-acid composition may prove helpful in identifying proteins for subsequent HCA.
Collapse
Affiliation(s)
- T Ojasoo
- Département Systèmes Moléculaires et Biologie Structurale, Université Pierre et Marie Curie, Paris, France
| | | |
Collapse
|
35
|
Zhang CT, Chou KC. An analysis of protein folding type prediction by seed-propagated sampling and jackknife test. JOURNAL OF PROTEIN CHEMISTRY 1995; 14:583-93. [PMID: 8561854 DOI: 10.1007/bf01886884] [Citation(s) in RCA: 27] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
In the development of methodology for statistical prediction of protein folding types, how to test the predicted results is a crucial problem. In addition to the resubstitution test in which the folding type of each protein from a training set is predicted based on the rules derived from the same set, cross-validation tests are needed. Among them, the single-test-set method seems to be least reliable due to the arbitrariness in selecting the test set. Although the leaving-one-out (or jackknife) test is more objective and hence more reliable, it may cause a severe information loss by leaving a protein in turn out of the training set when its size is not large enough. In order to overcome the above drawback, a seed-propagated sampling approach is proposed that can be used to generate any number of simulated proteins with a desired type based on a given training set database. There is no need to make any predetermined assumption about the statistical distribution function of the amino acid frequencies. Combined with the existing cross-validation methods, the new technique may provide a more objective estimation for various protein-folding-type prediction methods.
Collapse
Affiliation(s)
- C T Zhang
- Department of Physics, Tianjin University, China
| | | |
Collapse
|
36
|
Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A 1995; 92:8700-4. [PMID: 7568000 PMCID: PMC41034 DOI: 10.1073/pnas.92.19.8700] [Citation(s) in RCA: 348] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
We present a method for predicting protein folding class based on global protein chain description and a voting process. Selection of the best descriptors was achieved by a computer-simulated neural network trained on a data base consisting of 83 folding classes. Protein-chain descriptors include overall composition, transition, and distribution of amino acid attributes, such as relative hydrophobicity, predicted secondary structure, and predicted solvent exposure. Cross-validation testing was performed on 15 of the largest classes. The test shows that proteins were assigned to the correct class (correct positive prediction) with an average accuracy of 71.7%, whereas the inverse prediction of proteins as not belonging to a particular class (correct negative prediction) was 90-95% accurate. When tested on 254 structures used in this study, the top two predictions contained the correct class in 91% of the cases.
Collapse
Affiliation(s)
- I Dubchak
- Department of Chemistry, University of California, Berkeley 94720, USA
| | | | | | | |
Collapse
|
37
|
Abstract
A critical overview is given on the application of amino acid composition data for the establishment of the protein's identity (amino acids composition vs. protein identity, the AAC-PI method). Several criteria are used to measure the differences between the amino acid compositions of various proteins. The AAC-PI method unambiguously identifies proteins which belong to the families with a high phylogenetic conservancy of their sequences. The identification of pure proteins can be accomplished with a relatively high level of confidence. The AAC-PI method, however, sometimes needs the support of N-terminal or internal sequencing of proteins since, alone, it cannot distinguish whether the lack of finding a candidate protein in protein data bases is because the investigated amino acid composition corresponds to an unknown protein or its processed form or because it is a sum of at least two protein components, or whether it is due to other experimental errors. The identification of a few new proteins such as "arginine-rich protein", macrophage migration inhibitory factor (MIF) and the preformed neurotrophic factor present in the calf brain cytosol is also reported.
Collapse
Affiliation(s)
- A Galat
- Départment d'Ingénierie et d'Etudes des Protéines, D.S.V., C.E.A., C.E.-Saclay, Gif-sur-Yvette
| | | | | |
Collapse
|
38
|
Zhang CT, Chou KC. An eigenvalue-eigenvector approach to predicting protein folding types. JOURNAL OF PROTEIN CHEMISTRY 1995; 14:309-26. [PMID: 8590599 DOI: 10.1007/bf01886788] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
The accuracy of predicting protein folding types can be significantly enhanced by a recently developed algorithm in which the coupling effect among different amino acid components is taken into account [Chou and Zhang (1994) J. Biol. Chem. 269, 22014-22020]. However, in practical calculations using this powerful algorithm, one may sometimes face ill-conditioned matrices. To overcome such a difficulty, an effective eigenvalue-eigenvector approach is proposed. Furthermore, the new approach has been used to predict a recently constructed set of 76 proteins not included in the training set, and the accuracy of prediction is also much higher than those of other methods.
Collapse
Affiliation(s)
- C T Zhang
- Department of Physics, Tianjin University, China
| | | |
Collapse
|
39
|
Abstract
Proteins of known structures are generally classified into one of the following four folding types: alpha, beta, alpha + beta, and alpha/beta proteins. Recent findings [Muskal and Kim (1992) J. Mol. Biol. 225, 713-727] suggested that the folding type of a protein might basically depend on its amino acid composition. If this is true, why is that the predicted results of the protein folding type from amino acid composition always failed to reach the desired accuracy? An examination of the prediction approach indicates that none of the previous algorithms has ever taken into account the coupling effect among different amino acid components. In view of this, a new algorithm has been developed which distinguishes itself from the previous ones by incorporating such a coupling effect. The very high rates, 99.2% and 95.3%, of correct predictions thus obtained for a recently constructed training set of 120 proteins and testing set of 64 proteins, respectively, provide confirmation of the above suggestion.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Upjohn Laboratories, Kalamazoo, MI 49007-4940, USA
| |
Collapse
|
40
|
Chou KC. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins 1995; 21:319-44. [PMID: 7567954 DOI: 10.1002/prot.340210406] [Citation(s) in RCA: 350] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
The development of prediction methods based on statistical theory generally consists of two parts: one is focused on the exploration of new algorithms, and the other on the improvement of a training database. The current study is devoted to improving the prediction of protein structural classes from both of the two aspects. To explore a new algorithm, a method has been developed that makes allowance for taking into account the coupling effect among different amino acid components of a protein by a covariance matrix. To improve the training database, the selection of proteins is carried out so that they have (1) as many non-homologous structures as possible, and (2) a good quality of structure. Thus, 129 representative proteins are selected. They are classified into 30 alpha, 30 beta, 30 alpha + beta, 30 alpha/beta, and 9 zeta (irregular) proteins according to a new criterion that better reflects the feature of the structural classes concerned. The average accuracy of prediction by the current method for the 4 x 30 regular proteins is 99.2%, and that for 64 independent testing proteins not included in the training database is 95.3%. To further validate its efficiency, a jackknife analysis has been performed for the current method as well as the previous ones, and the results are also much in favor of the current method. To complete the mathematical basis, a theorem is presented and proved in Appendix A that is instructive for understanding the novel method at a deeper level.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Upjohn Laboratories, Kalamazoo, Michigan 49007-4940, USA
| |
Collapse
|
41
|
Chandonia JM, Karplus M. Neural networks for secondary structure and structural class predictions. Protein Sci 1995; 4:275-85. [PMID: 7757016 PMCID: PMC2143056 DOI: 10.1002/pro.5560040214] [Citation(s) in RCA: 82] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A pair of neural network-based algorithms is presented for predicting the tertiary structural class and the secondary structure of proteins. Each algorithm realizes improvements in accuracy based on information provided by the other. Structural class prediction of proteins nonhomologous to any in the training set is improved significantly, from 62.3% to 73.9%, and secondary structure prediction accuracy improves slightly, from 62.26% to 62.64%. A number of aspects of neural network optimization and testing are examined. They include network overtraining and an output filter based on a rolling average. Secondary structure prediction results vary greatly depending on the particular proteins chosen for the training and test sets; consequently, an appropriate measure of accuracy reflects the more unbiased approach of "jackknife" cross-validation (testing each protein in the data-base individually).
Collapse
Affiliation(s)
- J M Chandonia
- Biophysics Program, Harvard University, Cambridge, Massachusetts 02138, USA
| | | |
Collapse
|
42
|
Landale EC, Strong DD, Mohan S, Baylink DJ. Sequence comparison and predicted structure for the four exon-encoded regions of human insulin-like growth factor binding protein 4. Growth Factors 1995; 12:245-50. [PMID: 8930016 DOI: 10.3109/08977199509028963] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
The IGFBPs bind to and modulate the function of the IGFs in various ways. Human IGFBP-4 inhibits IGF mediated cell proliferation. The IGFBP exon-encoded regions were aligned and secondary structure predictions for hIGFBP-4 were developed yielding predicted 3D co-ordinates for each such region of hIGFBP-4. The exon 1 encoded region is the most conserved among the IGFBPs. That of hIGFBP-4 is predicted as an array of beta-strands that include the glycine and cysteine rich IGFBP consensus pattern and that terminate with a helix. The exon 2 encoded region is the most variable among the IGFBPs. That of hIGFBP-4 is predicted as mostly an amphipathic helix. The remaining regions are also conserved among the IGFBPs. Those of hIGFBP-4 are also predicted to contain helices. The predicted structure of hIGFBP-4 comprises amino terminal beta-strands with four helices in the carboxy terminal two thirds of the molecule.
Collapse
Affiliation(s)
- E C Landale
- Department of Mineral Metabolism, Pettis VA Medical Center and Loma Linda University, CA, USA
| | | | | | | |
Collapse
|
43
|
Abstract
A protein is usually classified into one of the following five structural classes: alpha, beta, alpha + beta, alpha/beta, and zeta (irregular). The structural class of a protein is correlated with its amino acid composition. However, given the amino acid composition of a protein, how may one predict its structural class? Various efforts have been made in addressing this problem. This review addresses the progress in this field, with the focus on the state of the art, which is featured by a novel prediction algorithm and a recently developed database. The novel algorithm is characterized by a covariance matrix that takes into account the coupling effect among different amino acid components of a protein. The new database was established based on the requirement that the classes should have (1) as many nonhomologous structures as possible, (2) good quality structure, and (3) typical or distinguishable features for each of the structural classes concerned. The very high success rate for both the training-set proteins and the testing-set proteins, which has been further validated by a simulated analysis and a jackknife analysis, indicates that it is possible to predict the structural class of a protein according to its amino acid composition if an ideal and complete database can be established. It also suggests that the overall fold of a protein is basically determined by its amino acid composition.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Upjohn Laboratories, Kalamazoo, MI 49007-4940, USA
| | | |
Collapse
|
44
|
Eisenhaber F, Persson B, Argos P. Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. Crit Rev Biochem Mol Biol 1995; 30:1-94. [PMID: 7587278 DOI: 10.3109/10409239509085139] [Citation(s) in RCA: 97] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
This review attempts a critical stock-taking of the current state of the science aimed at predicting structural features of proteins from their amino acid sequences. At the primary structure level, methods are considered for detection of remotely related sequences and for recognizing amino acid patterns to predict posttranslational modifications and binding sites. The techniques involving secondary structural features include prediction of secondary structure, membrane-spanning regions, and secondary structural class. At the tertiary structural level, methods for threading a sequence into a mainchain fold, homology modeling and assigning sequences to protein families with similar folds are discussed. A literature analysis suggests that, to date, threading techniques are not able to show their superiority over sequence pattern recognition methods. Recent progress in the state of ab initio structure calculation is reviewed in detail. The analysis shows that many structural features can be predicted from the amino acid sequence much better than just a few years ago and with attendant utility in experimental research. Best prediction can be achieved for new protein sequences that can be assigned to well-studied protein families. For single sequences without homologues, the folding problem has not yet been solved.
Collapse
Affiliation(s)
- F Eisenhaber
- Institut für Biochemie der Charité, Medizinische Fakultät, Humboldt-Universität zu Berlin, Fed. Rep. Germany
| | | | | |
Collapse
|
45
|
Chou K, Zhang C. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J Biol Chem 1994. [DOI: 10.1016/s0021-9258(17)31748-9] [Citation(s) in RCA: 92] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
46
|
Anthonsen HW, Baptista A, Drabløs F, Martel P, Petersen SB. The blind watchmaker and rational protein engineering. J Biotechnol 1994; 36:185-220. [PMID: 7765263 PMCID: PMC7173218 DOI: 10.1016/0168-1656(94)90152-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/1994] [Accepted: 04/23/1994] [Indexed: 01/27/2023]
Abstract
In the present review some scientific areas of key importance for protein engineering are discussed, such as problems involved in deducting protein sequence from DNA sequence (due to posttranscriptional editing, splicing and posttranslational modifications), modelling of protein structures by homology, NMR of large proteins (including probing the molecular surface with relaxation agents), simulation of protein structures by molecular dynamics and simulation of electrostatic effects in proteins (including pH-dependent effects). It is argued that all of these areas could be of key importance in most protein engineering projects, because they give access to increased and often unique information. In the last part of the review some potential areas for future applications of protein engineering approaches are discussed, such as non-conventional media, de novo design and nanotechnology.
Collapse
|
47
|
Rost B, Sander C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 1994; 19:55-72. [PMID: 8066087 DOI: 10.1002/prot.340190108] [Citation(s) in RCA: 1157] [Impact Index Per Article: 38.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments.
Collapse
Affiliation(s)
- B Rost
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
48
|
Abstract
Neural networks were used to generalize common themes found in transmembrane-spanning protein helices. Various-sized databases were used containing nonoverlapping sequences, each 25 amino acids long. Training consisted of sorting these sequences into 1 of 2 groups: transmembrane helical peptides or nontransmembrane peptides. Learning was measured using a test set 10% the size of the training set. As training set size increased from 214 sequences to 1,751 sequences, learning increased in a nonlinear manner from 75% to a high of 98%, then declined to a low of 87%. The final training database consisted of roughly equal numbers of transmembrane (928) and nontransmembrane (1,018) sequences. All transmembrane sequences were entered into the database with respect to their lipid membrane orientation: from inside the membrane to outside. Generalized transmembrane helix and nontransmembrane peptides were constructed from the maximally weighted connecting strengths of fully trained networks. Four generalized transmembrane helices were found to contain 9 consensus residues: a K-R-F triplet was found at the inside lipid interface, 2 isoleucine and 2 other phenylalanine residues were present in the helical body, and 2 tryptophan residues were found near the outside lipid interface. As a test of the training method, bacteriorhodopsin was examined to determine the position of its 7 transmembrane helices.
Collapse
Affiliation(s)
- G W Dombi
- Surgery Department, Wayne State University, Detroit, Michigan 48201
| | | |
Collapse
|