1
|
Ali S, Chourasia P, Patterson M. From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets. Med Biol Eng Comput 2024:10.1007/s11517-024-03074-3. [PMID: 38622438 DOI: 10.1007/s11517-024-03074-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 03/13/2024] [Indexed: 04/17/2024]
Abstract
Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.
Collapse
Affiliation(s)
- Sarwan Ali
- Georgia State University, Atlanta, GA, USA.
| | | | | |
Collapse
|
2
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
3
|
Selvaraj MK, Thakur A, Kumar M, Pinnaka AK, Suri CR, Siddhardha B, Elumalai SP. Ion-pumping microbial rhodopsin protein classification by machine learning approach. BMC Bioinformatics 2023; 24:29. [PMID: 36707759 PMCID: PMC9881276 DOI: 10.1186/s12859-023-05138-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 01/04/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Rhodopsin is a seven-transmembrane protein covalently linked with retinal chromophore that absorbs photons for energy conversion and intracellular signaling in eukaryotes, bacteria, and archaea. Haloarchaeal rhodopsins are Type-I microbial rhodopsin that elicits various light-driven functions like proton pumping, chloride pumping and Phototaxis behaviour. The industrial application of Ion-pumping Haloarchaeal rhodopsins is limited by the lack of full-length rhodopsin sequence-based classifications, which play an important role in Ion-pumping activity. The well-studied Haloarchaeal rhodopsin is a proton-pumping bacteriorhodopsin that shows promising applications in optogenetics, biosensitized solar cells, security ink, data storage, artificial retinal implant and biohydrogen generation. As a result, a low-cost computational approach is required to identify Ion-pumping Haloarchaeal rhodopsin sequences and its subtype. RESULTS This study uses a support vector machine (SVM) technique to identify these ion-pumping Haloarchaeal rhodopsin proteins. The haloarchaeal ion pumping rhodopsins viz., bacteriorhodopsin, halorhodopsin, xanthorhodopsin, sensoryrhodopsin and marine prokaryotic Ion-pumping rhodopsins like actinorhodopsin, proteorhodopsin have been utilized to develop the methods that accurately identified the ion pumping haloarchaeal and other type I microbial rhodopsins. We achieved overall maximum accuracy of 97.78%, 97.84% and 97.60%, respectively, for amino acid composition, dipeptide composition and hybrid approach on tenfold cross validation using SVM. Predictive models for each class of rhodopsin performed equally well on an independent data set. In addition to this, similar results were achieved using another machine learning technique namely random forest. Simultaneously predictive models performed equally well during five-fold cross validation. Apart from this study, we also tested the own, blank, BLAST dataset and annotated whole-genome rhodopsin sequences of PWS haloarchaeal isolates in the developed methods. The developed web server ( https://bioinfo.imtech.res.in/servers/rhodopred ) can identify the Ion Pumping Haloarchaeal rhodopsin proteins and their subtypes. We expect this web tool would be useful for rhodopsin researchers. CONCLUSION The overall performance of the developed method results show that it accurately identifies the Ionpumping Haloarchaeal rhodopsin and their subtypes using known and unknown microbial rhodopsin sequences. We expect that this study would be useful for optogenetics, molecular biologists and rhodopsin researchers.
Collapse
Affiliation(s)
- Muthu Krishnan Selvaraj
- grid.418099.dMTCC-Microbial Type Culture Collection and Gene Bank, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Anamika Thakur
- grid.418099.dVirology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Manoj Kumar
- grid.418099.dVirology Unit and Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Anil Kumar Pinnaka
- grid.418099.dMTCC-Microbial Type Culture Collection and Gene Bank, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Chander Raman Suri
- grid.418099.dBiosensor Department, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| | - Busi Siddhardha
- grid.412517.40000 0001 2152 9956Department of Microbiology, School of Life Sciences, Pondicherry University, Puducherry, 605014 India
| | - Senthil Prasad Elumalai
- grid.418099.dBiochemical Engineering Research and Process Development Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research (CSIR-IMTECH), Chandigarh, 160036 India
| |
Collapse
|
4
|
Lahorkar A, Bhosale H, Sane A, Ramakrishnan V, Jayaraman VK. Identification of Phase Separating Proteins With Distributed Reduced Alphabet Representations of Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:410-420. [PMID: 35139023 DOI: 10.1109/tcbb.2022.3149310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Phase separation of proteins play key roles in cellular physiology including bacterial division, tumorigenesis etc. Consequently, understanding the molecular forces that drive phase separation has gained considerable attention and several factors including hydrophobicity, protein dynamics, etc., have been implicated in phase separation. Data-driven identification of new phase separating proteins can enable in-depth understanding of cellular physiology and may pave way towards developing novel methods of tackling disease progression. In this work, we exploit the existing wealth of data on phase separating proteins to develop sequence-based machine learning method for prediction of phase separating proteins. We use reduced alphabet schemes based on hydrophobicity and conformational similarity along with distributed representation of protein sequences and biochemical properties as input features to Support Vector Machine (SVM) and Random Forest (RF) machine learning algorithms. We used both curated and balanced dataset for building the models. RF trained on balanced dataset with hydropathy, conformational similarity embeddings and biochemical properties achieved accuracy of 97%. Our work highlights the use of conformational similarity, a feature that reflects amino acid flexibility, and hydrophobicity for predicting phase separating proteins. Use of such "interpretable" features obtained from the ever-growing knowledgebase of phase separation is likely to improve prediction performances further.
Collapse
|
5
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit–explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring “the state of the art” in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI–PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI–PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI–PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the “state of the art” on research in the AI–PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
- *Correspondence: Myriam M. Altamirano-Bustamante,
| |
Collapse
|
6
|
PupStruct: Prediction of Pupylated Lysine Residues Using Structural Properties of Amino Acids. Genes (Basel) 2020; 11:genes11121431. [PMID: 33260770 PMCID: PMC7761138 DOI: 10.3390/genes11121431] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 11/23/2020] [Accepted: 11/23/2020] [Indexed: 12/23/2022] Open
Abstract
Post-translational modification (PTM) is a critical biological reaction which adds to the diversification of the proteome. With numerous known modifications being studied, pupylation has gained focus in the scientific community due to its significant role in regulating biological processes. The traditional experimental practice to detect pupylation sites proved to be expensive and requires a lot of time and resources. Thus, there have been many computational predictors developed to challenge this issue. However, performance is still limited. In this study, we propose another computational method, named PupStruct, which uses the structural information of amino acids with a radial basis kernel function Support Vector Machine (SVM) to predict pupylated lysine residues. We compared PupStruct with three state-of-the-art predictors from the literature where PupStruct has validated a significant improvement in performance over them with statistical metrics such as sensitivity (0.9234), specificity (0.9359), accuracy (0.9296), precision (0.9349), and Mathew’s correlation coefficient (0.8616) on a benchmark dataset.
Collapse
|
7
|
Zhang L, Kong L. A Novel Amino Acid Properties Selection Method for Protein Fold Classification. Protein Pept Lett 2020; 27:287-294. [PMID: 32207399 DOI: 10.2174/0929866526666190718151753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 04/17/2019] [Accepted: 06/10/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Amino acid physicochemical properties encoded in protein primary structure play a crucial role in protein folding. However, it is not yet clear which of the properties are the most suitable for protein fold classification. OBJECTIVE To avoid exhaustively searching the total properties space, an amino acid properties selection method was proposed in this study to rapidly obtain a suitable properties combination for protein fold classification. METHODS The proposed amino acid properties selection method was based on sequential floating forward selection strategy. Beginning with an empty set, variable number of features were added iteratively until achieving the iteration termination condition. RESULTS The experimental results indicate that the proposed method improved prediction accuracies by 0.26-5% on a widely used benchmark dataset with appropriately selected amino acid properties. CONCLUSION The proposed properties selection method can be extended to other biomolecule property related classification problems in bioinformatics.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, China.,College of Sciences, Northeastern University, Shenyang, China
| | - Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, China
| |
Collapse
|
8
|
Patil K, Chouhan U. Relevance of Machine Learning Techniques and Various Protein Features in Protein Fold Classification: A Review. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190204154038] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background:
Protein fold prediction is a fundamental step in Structural Bioinformatics.
The tertiary structure of a protein determines its function and to predict its tertiary structure, fold
prediction serves an important role. Protein fold is simply the arrangement of the secondary
structure elements relative to each other in space. A number of studies have been carried out till
date by different research groups working worldwide in this field by using the combination of
different benchmark datasets, different types of descriptors, features and classification techniques.
Objective:
In this study, we have tried to put all these contributions together, analyze their study
and to compare different techniques used by them.
Methods:
Different features are derived from protein sequence, its secondary structure, different
physicochemical properties of amino acids, domain composition, Position Specific Scoring Matrix,
profile and threading techniques.
Conclusion:
Combination of these different features can improve classification accuracy to a
large extent. With the help of this survey, one can know the most suitable feature/attribute set and
classification technique for this multi-class protein fold classification problem.
Collapse
Affiliation(s)
- Komal Patil
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| | - Usha Chouhan
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| |
Collapse
|
9
|
Meng C, Jin S, Wang L, Guo F, Zou Q. AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine. Front Bioeng Biotechnol 2019; 7:224. [PMID: 31620433 PMCID: PMC6759716 DOI: 10.3389/fbioe.2019.00224] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Accepted: 09/03/2019] [Indexed: 01/03/2023] Open
Abstract
Antioxidant proteins play important roles in countering oxidative damage in organisms. Because it is time-consuming and has a high cost, the accurate identification of antioxidant proteins using biological experiments is a challenging task. For these reasons, we proposed a model using machine-learning algorithms that we named AOPs-SVM, which was developed based on sequence features and a support vector machine. Using a testing dataset, we conducted a jackknife cross-validation test with the proposed AOPs-SVM classifier and obtained 0.68 in sensitivity, 0.985 in specificity, 0.942 in average accuracy, 0.741 in MCC, and 0.832 in AUC. This outperformed existing classifiers. The experiment results demonstrate that the AOPs-SVM is an effective classifier and contributes to the research related to antioxidant proteins. A web server was built at http://server.malab.cn/AOPs-SVM/index.jsp to provide open access.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
10
|
Identifying anticancer peptides by using a generalized chaos game representation. J Math Biol 2018; 78:441-463. [PMID: 30291366 DOI: 10.1007/s00285-018-1279-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 08/01/2018] [Indexed: 10/28/2022]
Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Collapse
|
11
|
Rajendran S, Jothi A. Sequentially distant but structurally similar proteins exhibit fold specific patterns based on their biophysical properties. Comput Biol Chem 2018; 75:143-153. [PMID: 29783123 DOI: 10.1016/j.compbiolchem.2018.05.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Revised: 05/06/2018] [Accepted: 05/07/2018] [Indexed: 11/25/2022]
Abstract
The Three-dimensional structure of a protein depends on the interaction between their amino acid residues. These interactions are in turn influenced by various biophysical properties of the amino acids. There are several examples of proteins that share the same fold but are very dissimilar at the sequence level. For proteins to share a common fold some crucial interactions should be maintained despite insignificant sequence similarity. Since the interactions are because of the biophysical properties of the amino acids, we should be able to detect descriptive patterns for folds at such a property level. In this line, the main focus of our research is to analyze such proteins and to characterize them in terms of their biophysical properties. Protein structures with sequence similarity lesser than 40% were selected for ten different subfolds from three different mainfolds (according to CATH classification) and were used for this analysis. We used the normalized values of the 49 physio-chemical, energetic and conformational properties of amino acids. We characterize the folds based on the average biophysical property values. We also observed a fold specific correlational behavior of biophysical properties despite a very low sequence similarity in our data. We further trained three different binary classification models (Naive Bayes-NB, Support Vector Machines-SVM and Bayesian Generalized Linear Model-BGLM) which could discriminate mainfold based on the biophysical properties. We also show that among the three generated models, the BGLM classifier model was able to discriminate protein sequences coming under all beta category with 81.43% accuracy and all alpha, alpha-beta proteins with 83.37% accuracy.
Collapse
Affiliation(s)
- Senthilnathan Rajendran
- Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, 613401, India.
| | - Arunachalam Jothi
- Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, 613401, India.
| |
Collapse
|
12
|
Kumar R, Kumari B, Kumar M. Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine. PeerJ 2017; 5:e3561. [PMID: 28890846 PMCID: PMC5588793 DOI: 10.7717/peerj.3561] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Accepted: 06/20/2017] [Indexed: 12/15/2022] Open
Abstract
Background The endoplasmic reticulum plays an important role in many cellular processes, which includes protein synthesis, folding and post-translational processing of newly synthesized proteins. It is also the site for quality control of misfolded proteins and entry point of extracellular proteins to the secretory pathway. Hence at any given point of time, endoplasmic reticulum contains two different cohorts of proteins, (i) proteins involved in endoplasmic reticulum-specific function, which reside in the lumen of the endoplasmic reticulum, called as endoplasmic reticulum resident proteins and (ii) proteins which are in process of moving to the extracellular space. Thus, endoplasmic reticulum resident proteins must somehow be distinguished from newly synthesized secretory proteins, which pass through the endoplasmic reticulum on their way out of the cell. Approximately only 50% of the proteins used in this study as training data had endoplasmic reticulum retention signal, which shows that these signals are not essentially present in all endoplasmic reticulum resident proteins. This also strongly indicates the role of additional factors in retention of endoplasmic reticulum-specific proteins inside the endoplasmic reticulum. Methods This is a support vector machine based method, where we had used different forms of protein features as inputs for support vector machine to develop the prediction models. During training leave-one-out approach of cross-validation was used. Maximum performance was obtained with a combination of amino acid compositions of different part of proteins. Results In this study, we have reported a novel support vector machine based method for predicting endoplasmic reticulum resident proteins, named as ERPred. During training we achieved a maximum accuracy of 81.42% with leave-one-out approach of cross-validation. When evaluated on independent dataset, ERPred did prediction with sensitivity of 72.31% and specificity of 83.69%. We have also annotated six different proteomes to predict the candidate endoplasmic reticulum resident proteins in them. A webserver, ERPred, was developed to make the method available to the scientific community, which can be accessed at http://proteininformatics.org/mkumar/erpred/index.html. Discussion We found that out of 124 proteins of the training dataset, only 66 proteins had endoplasmic reticulum retention signals, which shows that these signals are not an absolute necessity for endoplasmic reticulum resident proteins to remain inside the endoplasmic reticulum. This observation also strongly indicates the role of additional factors in retention of proteins inside the endoplasmic reticulum. Our proposed predictor, ERPred, is a signal independent tool. It is tuned for the prediction of endoplasmic reticulum resident proteins, even if the query protein does not contain specific ER-retention signal.
Collapse
Affiliation(s)
- Ravindra Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India.,Current affiliation: Newe-Ya'ar Research Center, Agricultural Research Organization, Ramat Yishay, Israel
| | - Bandana Kumari
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| | - Manish Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| |
Collapse
|
13
|
Tripathi P, Pandey PN. A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou's pseudo amino acid composition. J Theor Biol 2017; 424:49-54. [DOI: 10.1016/j.jtbi.2017.04.027] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Revised: 04/24/2017] [Accepted: 04/27/2017] [Indexed: 10/19/2022]
|
14
|
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med 2017; 79:1-8. [DOI: 10.1016/j.artmed.2017.03.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Revised: 03/06/2017] [Accepted: 03/07/2017] [Indexed: 12/13/2022]
|
15
|
Xia J, Peng Z, Qi D, Mu H, Yang J. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier. Bioinformatics 2016; 33:863-870. [DOI: 10.1093/bioinformatics/btw768] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 12/02/2016] [Indexed: 11/14/2022] Open
|
16
|
Arana-Daniel N, Gallegos AA, López-Franco C, Alanís AY, Morales J, López-Franco A. Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures. Evol Bioinform Online 2016; 12:285-302. [PMID: 27980384 PMCID: PMC5140013 DOI: 10.4137/ebo.s40912] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/19/2016] [Accepted: 10/20/2016] [Indexed: 11/05/2022] Open
Abstract
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.
Collapse
Affiliation(s)
- Nancy Arana-Daniel
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alberto A Gallegos
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Carlos López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alma Y Alanís
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Jacob Morales
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Adriana López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| |
Collapse
|
17
|
Wei L, Zou Q. Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition. Int J Mol Sci 2016; 17:ijms17122118. [PMID: 27999256 PMCID: PMC5187918 DOI: 10.3390/ijms17122118] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Revised: 12/03/2016] [Accepted: 12/11/2016] [Indexed: 01/22/2023] Open
Abstract
Knowledge on protein folding has a profound impact on understanding the heterogeneity and molecular function of proteins, further facilitating drug design. Predicting the 3D structure (fold) of a protein is a key problem in molecular biology. Determination of the fold of a protein mainly relies on molecular experimental methods. With the development of next-generation sequencing techniques, the discovery of new protein sequences has been rapidly increasing. With such a great number of proteins, the use of experimental techniques to determine protein folding is extremely difficult because these techniques are time consuming and expensive. Thus, developing computational prediction methods that can automatically, rapidly, and accurately classify unknown protein sequences into specific fold categories is urgently needed. Computational recognition of protein folds has been a recent research hotspot in bioinformatics and computational biology. Many computational efforts have been made, generating a variety of computational prediction methods. In this review, we conduct a comprehensive survey of recent computational methods, especially machine learning-based methods, for protein fold recognition. This review is anticipated to assist researchers in their pursuit to systematically understand the computational recognition of protein folds.
Collapse
Affiliation(s)
- Leyi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin 300354, China.
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin 300354, China.
| |
Collapse
|
18
|
ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6802832. [PMID: 27660761 PMCID: PMC5021882 DOI: 10.1155/2016/6802832] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 07/15/2016] [Accepted: 08/07/2016] [Indexed: 11/17/2022]
Abstract
Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.
Collapse
|
19
|
Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. J Theor Biol 2016; 402:117-28. [PMID: 27164998 DOI: 10.1016/j.jtbi.2016.05.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Revised: 04/20/2016] [Accepted: 05/02/2016] [Indexed: 11/24/2022]
Abstract
Predicting the three-dimensional (3-D) structure of a protein is an important task in the field of bioinformatics and biological sciences. However, directly predicting the 3-D structure from the primary structure is hard to achieve. Therefore, predicting the fold or structural class of a protein sequence is generally used as an intermediate step in determining the protein's 3-D structure. For protein fold recognition (PFR) and structural class prediction (SCP), two steps are required - feature extraction step and classification step. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physicochemical-based information to extract features. In this study, we explore the importance of utilizing the physicochemical properties of amino acids for improving PFR and SCP accuracies. For this, we propose a Forward Consecutive Search (FCS) scheme which aims to strategically select physicochemical attributes that will supplement the existing feature extraction techniques for PFR and SCP. An exhaustive search is conducted on all the existing 544 physicochemical attributes using the proposed FCS scheme and a subset of physicochemical attributes is identified. Features extracted from these selected attributes are then combined with existing syntactical-based and evolutionary-based features, to show an improvement in the recognition and prediction performance on benchmark datasets.
Collapse
|
20
|
Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A. Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 2016; 393:67-74. [DOI: 10.1016/j.jtbi.2015.12.018] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 10/22/2022]
|
21
|
Feng Z, Hu X, Jiang Z, Song H, Ashraf MA. The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements. Saudi J Biol Sci 2016; 23:189-97. [PMID: 26980999 PMCID: PMC4778582 DOI: 10.1016/j.sjbs.2015.10.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Revised: 10/08/2015] [Accepted: 10/12/2015] [Indexed: 11/28/2022] Open
Abstract
The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak.
Collapse
Affiliation(s)
- Zhenxing Feng
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Xiuzhen Hu
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhuo Jiang
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Hangyu Song
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Muhammad Aqeel Ashraf
- Water Research Unit, Faculty of Science and Natural Resources, University Malaysia Sabah, 88400 Kota Kinabalu, Sabah, Malaysia
| |
Collapse
|
22
|
Sharma R, Dehzangi A, Lyons J, Paliwal K, Tsunoda T, Sharma A. Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC. IEEE Trans Nanobioscience 2015; 14:915-26. [DOI: 10.1109/tnb.2015.2500186] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
23
|
Lyons J, Dehzangi A, Heffernan R, Yang Y, Zhou Y, Sharma A, Paliwal K. Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Trans Nanobioscience 2015. [DOI: 10.1109/tnb.2015.2457906] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
24
|
Wei L, Liao M, Gao X, Zou Q. Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique. IEEE Trans Nanobioscience 2015; 14:649-59. [DOI: 10.1109/tnb.2015.2450233] [Citation(s) in RCA: 81] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
25
|
|
26
|
Paliwal KK, Sharma A, Lyons J, Dehzangi A. Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information. BMC Bioinformatics 2014; 15 Suppl 16:S12. [PMID: 25521502 PMCID: PMC4290640 DOI: 10.1186/1471-2105-15-s16-s12] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein fold recognition and protein secondary structure prediction are transitional steps in identifying the three dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On the other hand, the SPINE-X predictor has been developed and applied for protein secondary structure prediction. Several reported methods for protein fold recognition have only limited accuracy. In this paper, we have developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure using SPINE-X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of 8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction accuracies for sequence similarity values below 40% and 25%, respectively. We also obtain 90.6% and 77.0% prediction accuracies, respectively, for the Extended Ding and Dubchak and Taguchi and Gromiha benchmark protein fold recognition datasets widely used for in the literature.
Collapse
|
27
|
Paliwal KK, Sharma A, Lyons J, Dehzangi A. A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition. IEEE Trans Nanobioscience 2014; 13:44-50. [PMID: 24594513 DOI: 10.1109/tnb.2013.2296050] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In biological sciences, the deciphering of a three dimensional structure of a protein sequence is considered to be an important and challenging task. The identification of protein folds from primary protein sequences is an intermediate step in discovering the three dimensional structure of a protein. This can be done by utilizing feature extraction technique to accurately extract all the relevant information followed by employing a suitable classifier to label an unknown protein. In the past, several feature extraction techniques have been developed but with limited recognition accuracy only. In this study, we have developed a feature extraction technique based on tri-grams computed directly from Position Specific Scoring Matrices. The effectiveness of the feature extraction technique has been shown on two benchmark datasets. The proposed technique exhibits up to 4.4% improvement in protein fold recognition accuracy compared to the state-of-the-art feature extraction techniques.
Collapse
|
28
|
Lyons J, Biswas N, Sharma A, Dehzangi A, Paliwal KK. Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping. J Theor Biol 2014; 354:137-45. [DOI: 10.1016/j.jtbi.2014.03.033] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2013] [Revised: 03/05/2014] [Accepted: 03/21/2014] [Indexed: 01/21/2023]
|
29
|
Feng Z, Hu X. Recognition of 27-class protein folds by adding the interaction of segments and motif information. BIOMED RESEARCH INTERNATIONAL 2014; 2014:262850. [PMID: 25136571 PMCID: PMC4127253 DOI: 10.1155/2014/262850] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Accepted: 06/28/2014] [Indexed: 01/31/2023]
Abstract
The recognition of protein folds is an important step for the prediction of protein structure and function. After the recognition of 27-class protein folds in 2001 by Ding and Dubchak, prediction algorithms, prediction parameters, and new datasets for the prediction of protein folds have been improved. However, the influences of interactions from predicted secondary structure segments and motif information on protein folding have not been considered. Therefore, the recognition of 27-class protein folds with the interaction of segments and motif information is very important. Based on the 27-class folds dataset built by Liu et al., amino acid composition, the interactions of secondary structure segments, motif frequency, and predicted secondary structure information were extracted. Using the Random Forest algorithm and the ensemble classification strategy, 27-class protein folds and corresponding structural classification were identified by independent test. The overall accuracy of the testing set and structural classification measured up to 78.38% and 92.55%, respectively. When the training set and testing set were combined, the overall accuracy by 5-fold cross validation was 81.16%. In order to compare with the results of previous researchers, the method above was tested on Ding and Dubchak's dataset which has been widely used by many previous researchers, and an improved overall accuracy 70.24% was obtained.
Collapse
Affiliation(s)
- Zhenxing Feng
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Xiuzhen Hu
- Department of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
30
|
Das Roy R, Bhardwaj M, Bhatnagar V, Chakraborty K, Dash D. How do eubacterial organisms manage aggregation-prone proteome? F1000Res 2014; 3:137. [PMID: 25339987 PMCID: PMC4193397 DOI: 10.12688/f1000research.4307.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/24/2014] [Indexed: 11/20/2022] Open
Abstract
Eubacterial genomes vary considerably in their nucleotide composition. The percentage of genetic material constituted by guanosine and cytosine (GC) nucleotides ranges from 20% to 70%. It has been posited that GC-poor organisms are more dependent on protein folding machinery. Previous studies have ascribed this to the accumulation of mildly deleterious mutations in these organisms due to population bottlenecks. This phenomenon has been supported by protein folding simulations, which showed that proteins encoded by GC-poor organisms are more prone to aggregation than proteins encoded by GC-rich organisms. To test this proposition using a genome-wide approach, we classified different eubacterial proteomes in terms of their aggregation propensity and chaperone-dependence using multiple machine learning models. In contrast to the expected decrease in protein aggregation with an increase in GC richness, we found that the aggregation propensity of proteomes increases with GC content. A similar and even more significant correlation was obtained with the GroEL-dependence of proteomes: GC-poor proteomes have evolved to be less dependent on GroEL than GC-rich proteomes. We thus propose that a decrease in eubacterial GC content may have been selected in organisms facing proteostasis problems.
Collapse
Affiliation(s)
- Rishi Das Roy
- GNR Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Council of Scientific and Industrial Research, Delhi, 110007, India ; Department of Biotechnology, University of Pune, Pune, 411007, India
| | - Manju Bhardwaj
- Department of Computer Science, Maitreyi College, Chanakyapuri, Delhi, 110021, India
| | - Vasudha Bhatnagar
- Department of Computer Science, Faculty of Mathematical Sciences, University of Delhi, Delhi, 110007, India
| | - Kausik Chakraborty
- GNR Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Council of Scientific and Industrial Research, Delhi, 110007, India
| | - Debasis Dash
- GNR Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Council of Scientific and Industrial Research, Delhi, 110007, India ; Department of Biotechnology, University of Pune, Pune, 411007, India
| |
Collapse
|
31
|
Dehzangi A, Paliwal K, Lyons J, Sharma A, Sattar A. A Segmentation-Based Method to Extract Structural and Evolutionary Features for Protein Fold Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:510-519. [PMID: 26356019 DOI: 10.1109/tcbb.2013.2296317] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Protein fold recognition (PFR) is considered as an important step towards the protein structure prediction problem. Despite all the efforts that have been made so far, finding an accurate and fast computational approach to solve the PFR still remains a challenging problem for bioinformatics and computational biology. In this study, we propose the concept of segmented-based feature extraction technique to provide local evolutionary information embedded in position specific scoring matrix (PSSM) and structural information embedded in the predicted secondary structure of proteins using SPINE-X. We also employ the concept of occurrence feature to extract global discriminatory information from PSSM and SPINE-X. By applying a support vector machine (SVM) to our extracted features, we enhance the protein fold prediction accuracy for 7.4 percent over the best results reported in the literature. We also report 73.8 percent prediction accuracy for a data set consisting of proteins with less than 25 percent sequence similarity rates and 80.7 percent prediction accuracy for a data set with proteins belonging to 110 folds with less than 40 percent sequence similarity rates. We also investigate the relation between the number of folds and the number of features being used and show that the number of features should be increased to get better protein fold prediction results when the number of folds is relatively large.
Collapse
|
32
|
Das Roy R, Dash D. Selection of relevant features from amino acids enables development of robust classifiers. Amino Acids 2014; 46:1343-51. [PMID: 24604165 DOI: 10.1007/s00726-014-1697-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 02/14/2014] [Indexed: 12/30/2022]
Abstract
Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan ( http://code.google.com/p/pro-gyan/ ), to build and share executable classifiers without programming knowledge.
Collapse
Affiliation(s)
- Rishi Das Roy
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Mall Road, Delhi, 110007, India,
| | | |
Collapse
|
33
|
Sharma A, Lyons J, Dehzangi A, Paliwal KK. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J Theor Biol 2013; 320:41-6. [DOI: 10.1016/j.jtbi.2012.12.008] [Citation(s) in RCA: 118] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Revised: 12/04/2012] [Accepted: 12/05/2012] [Indexed: 11/26/2022]
|
34
|
Panwar B, Gupta S, Raghava GPS. Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information. BMC Bioinformatics 2013; 14:44. [PMID: 23387468 PMCID: PMC3577447 DOI: 10.1186/1471-2105-14-44] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 01/31/2013] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND The vitamins are important cofactors in various enzymatic-reactions. In past, many inhibitors have been designed against vitamin binding pockets in order to inhibit vitamin-protein interactions. Thus, it is important to identify vitamin interacting residues in a protein. It is possible to detect vitamin-binding pockets on a protein, if its tertiary structure is known. Unfortunately tertiary structures of limited proteins are available. Therefore, it is important to develop in-silico models for predicting vitamin interacting residues in protein from its primary structure. RESULTS In this study, first we compared protein-interacting residues of vitamins with other ligands using Two Sample Logo (TSL). It was observed that ATP, GTP, NAD, FAD and mannose preferred {G,R,K,S,H}, {G,K,T,S,D,N}, {T,G,Y}, {G,Y,W} and {Y,D,W,N,E} residues respectively, whereas vitamins preferred {Y,F,S,W,T,G,H} residues for the interaction with proteins. Furthermore, compositional information of preferred and non-preferred residues along with patterns-specificity was also observed within different vitamin-classes. Vitamins A, B and B6 preferred {F,I,W,Y,L,V}, {S,Y,G,T,H,W,N,E} and {S,T,G,H,Y,N} interacting residues respectively. It suggested that protein-binding patterns of vitamins are different from other ligands, and motivated us to develop separate predictor for vitamins and their sub-classes. The four different prediction modules, (i) vitamin interacting residues (VIRs), (ii) vitamin-A interacting residues (VAIRs), (iii) vitamin-B interacting residues (VBIRs) and (iv) pyridoxal-5-phosphate (vitamin B6) interacting residues (PLPIRs) have been developed. We applied various classifiers of SVM, BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk etc., as machine learning techniques, using binary and Position-Specific Scoring Matrix (PSSM) features of protein sequences. Finally, we selected best performing SVM modules and obtained highest MCC of 0.53, 0.48, 0.61, 0.81 for VIRs, VAIRs, VBIRs, PLPIRs respectively, using PSSM-based evolutionary information. All the modules developed in this study have been trained and tested on non-redundant datasets and evaluated using five-fold cross-validation technique. The performances were also evaluated on the balanced and different independent datasets. CONCLUSIONS This study demonstrates that it is possible to predict VIRs, VAIRs, VBIRs and PLPIRs from evolutionary information of protein sequence. In order to provide service to the scientific community, we have developed web-server and standalone software VitaPred (http://crdd.osdd.net/raghava/vitapred/).
Collapse
Affiliation(s)
- Bharat Panwar
- Bioinformatics Centre, Institute of Microbial Technology (CSIR), Chandigarh, India
| | | | | |
Collapse
|
35
|
Sui T, Yang Y, Wang X. Sequence-Based Feature Extraction for Type III Effector Prediction. ACTA ACUST UNITED AC 2013. [DOI: 10.7763/ijbbb.2013.v3.206] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
36
|
Dehzangi A, Paliwal K, Lyons J, Sharma A, Sattar A. Enhancing Protein Fold Prediction Accuracy Using Evolutionary and Structural Features. PATTERN RECOGNITION IN BIOINFORMATICS 2013. [DOI: 10.1007/978-3-642-39159-0_18] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
37
|
Zeng J, Kirk BD, Gou Y, Wang Q, Ma J. Genome-wide polycomb target gene prediction in Drosophila melanogaster. Nucleic Acids Res 2012; 40:5848-63. [PMID: 22416065 PMCID: PMC3401425 DOI: 10.1093/nar/gks209] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
As key epigenetic regulators, polycomb group (PcG) proteins are responsible for the control of cell proliferation and differentiation as well as stem cell pluripotency and self-renewal. Aberrant epigenetic modification by PcG is strongly correlated with the severity and invasiveness of many types of cancers. Unfortunately, the molecular mechanism of PcG-mediated epigenetic regulation remained elusive, partly due to the extremely limited pool of experimentally confirmed PcG target genes. In order to facilitate experimental identification of PcG target genes, here we propose a novel computational method, EpiPredictor, that achieved significantly higher matching ratios with several recent chromatin immunoprecipitation studies than jPREdictor, an existing computational method. We further validated a subset of genes that were uniquely predicted by EpiPredictor by cross-referencing existing literature and by experimental means. Our data suggest that multiple transcription factor networking at the cis-regulatory elements is critical for PcG recruitment, while high GC content and high conservation level are also important features of PcG target genes. EpiPredictor should substantially expedite experimental discovery of PcG target genes by providing an effective initial screening tool. From a computational standpoint, our strategy of modelling transcription factor interaction with a non-linear kernel is original, effective and transferable to many other applications.
Collapse
Affiliation(s)
- Jia Zeng
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | | | | | | | | |
Collapse
|
38
|
Mohammad TAS, Nagarajaram HA. SVM-based method for protein structural class prediction using secondary structural content and structural information of amino acids. J Bioinform Comput Biol 2011; 9:489-502. [PMID: 21776605 DOI: 10.1142/s0219720011005422] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Revised: 09/15/2010] [Accepted: 01/07/2011] [Indexed: 11/18/2022]
Abstract
The knowledge collated from the known protein structures has revealed that the proteins are usually folded into the four structural classes: all-α, all-β, α/β and α + β. A number of methods have been proposed to predict the protein's structural class from its primary structure; however, it has been observed that these methods fail or perform poorly in the cases of distantly related sequences. In this paper, we propose a new method for protein structural class prediction using low homology (twilight-zone) protein sequences dataset. Since protein structural class prediction is a typical classification problem, we have developed a Support Vector Machine (SVM)-based method for protein structural class prediction that uses features derived from the predicted secondary structure and predicted burial information of amino acid residues. The examination of different individual as well as feature combinations revealed that the combination of secondary structural content, secondary structural and solvent accessibility state frequencies of amino acids gave rise to the best leave-one-out cross-validation accuracy of ~81% which is comparable to the best accuracy reported in the literature so far.
Collapse
Affiliation(s)
- Tabrez Anwar Shamim Mohammad
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics (CDFD), Nampally, Hyderabad 500001, India.
| | | |
Collapse
|
39
|
Sequence-based classification using discriminatory motif feature selection. PLoS One 2011; 6:e27382. [PMID: 22102890 PMCID: PMC3213122 DOI: 10.1371/journal.pone.0027382] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2011] [Accepted: 10/16/2011] [Indexed: 11/19/2022] Open
Abstract
Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all -mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length , such that potentially important, longer () predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.
Collapse
|
40
|
Zhao N, Pang B, Shyu CR, Korkin D. Feature-based classification of native and non-native protein-protein interactions: Comparing supervised and semi-supervised learning approaches. Proteomics 2011; 11:4321-30. [PMID: 22002942 DOI: 10.1002/pmic.201100217] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2011] [Revised: 07/26/2011] [Accepted: 08/18/2011] [Indexed: 12/12/2022]
Abstract
Structural knowledge about protein-protein interactions can provide insights to the basic processes underlying cell function. Recent progress in experimental and computational structural biology has led to a rapid growth of experimentally resolved structures and computationally determined near-native models of protein-protein interactions. However, determining whether a protein-protein interaction is physiological or it is the artifact of an experimental or computational method remains a challenging problem. In this work, we have addressed two related problems. The first problem is distinguishing between the experimentally obtained physiological and crystal-packing protein-protein interactions. The second problem is concerned with the classification of near-native and inaccurate docking models. We first defined a universal set of interface features and employed a support vector machines (SVM)-based approach to classify the interactions for both problems, with the accuracy, precision, and recall for the first problem classifier reaching 93%. To improve the classification, we next developed a semi-supervised learning approach for the second problem, using transductive SVM (TSVM). We applied both classifiers to a commonly used protein docking benchmark of 124 complexes. We found that while we reached the classification accuracies of 78.9% for the SVM classifier and 80.3% for the TSVM classifier, improving protein-docking methods by model re-ranking remains a challenging problem.
Collapse
Affiliation(s)
- Nan Zhao
- Informatics Institute and Department of Computer Science, University of Missouri, Columbia, MO, USA
| | | | | | | |
Collapse
|
41
|
Muda HM, Saad P, Othman RM. Remote protein homology detection and fold recognition using two-layer support vector machine classifiers. Comput Biol Med 2011; 41:687-99. [PMID: 21704312 DOI: 10.1016/j.compbiomed.2011.06.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2009] [Revised: 03/16/2011] [Accepted: 06/05/2011] [Indexed: 02/07/2023]
Abstract
Remote protein homology detection and fold recognition refer to detection of structural homology in proteins where there are small or no similarities in the sequence. To detect protein structural classes from protein primary sequence information, homology-based methods have been developed, which can be divided to three types: discriminative classifiers, generative models for protein families and pairwise sequence comparisons. Support Vector Machines (SVM) and Neural Networks (NN) are two popular discriminative methods. Recent studies have shown that SVM has fast speed during training, more accurate and efficient compared to NN. We present a comprehensive method based on two-layer classifiers. The 1st layer is used to detect up to superfamily and family in SCOP hierarchy using optimized binary SVM classification rules. It used the kernel function known as the Bio-kernel, which incorporates the biological information in the classification process. The 2nd layer uses discriminative SVM algorithm with string kernel that will detect up to protein fold level in SCOP hierarchy. The results obtained were evaluated using mean ROC and mean MRFP and the significance of the result produced with pairwise t-test was tested. Experimental results show that our approaches significantly improve the performance of remote protein homology detection and fold recognition for all three different version SCOP datasets (1.53, 1.67 and 1.73). We achieved 4.19% improvements in term of mean ROC in SCOP 1.53, 4.75% in SCOP 1.67 and 4.03% in SCOP 1.73 datasets when compared to the result produced by well-known methods. The combination of first layer and second layer of BioSVM-2L performs well in remote homology detection and fold recognition even in three different versions of datasets.
Collapse
Affiliation(s)
- Hilmi M Muda
- Laboratory of Computational Intelligence and Biology, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Malaysia
| | | | | |
Collapse
|
42
|
Zhao N, Pang B, Shyu CR, Korkin D. Structural similarity and classification of protein interaction interfaces. PLoS One 2011; 6:e19554. [PMID: 21589874 PMCID: PMC3093400 DOI: 10.1371/journal.pone.0019554] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2011] [Accepted: 04/05/2011] [Indexed: 11/25/2022] Open
Abstract
Interactions between proteins play a key role in many cellular processes.
Studying protein-protein interactions that share similar interaction interfaces
may shed light on their evolution and could be helpful in elucidating the
mechanisms behind stability and dynamics of the protein complexes. When two
complexes share structurally similar subunits, the similarity of the interaction
interfaces can be found through a structural superposition of the subunits.
However, an accurate detection of similarity between the protein complexes
containing subunits of unrelated structure remains an open problem. Here, we present an alignment-free machine learning approach to measure interface
similarity. The approach relies on the feature-based representation of protein
interfaces and does not depend on the superposition of the interacting subunit
pairs. Specifically, we develop an SVM classifier of similar and dissimilar
interfaces and derive a feature-based interface similarity measure. Next, the
similarity measure is applied to a set of 2,806×2,806 binary complex pairs
to build a hierarchical classification of protein-protein interactions. Finally,
we explore case studies of similar interfaces from each level of the hierarchy,
considering cases when the subunits forming interactions are either homologous
or structurally unrelated. The analysis has suggested that the positions of
charged residues in the homologous interfaces are not necessarily conserved and
may exhibit more complex conservation patterns.
Collapse
Affiliation(s)
- Nan Zhao
- Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, Missouri, United States of
America
| | - Bin Pang
- Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, Missouri, United States of
America
| | - Chi-Ren Shyu
- Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, Missouri, United States of
America
| | - Dmitry Korkin
- Informatics Institute and Department of
Computer Science, University of Missouri, Columbia, Missouri, United States of
America
- Bond Life Science Center, University of
Missouri, Columbia, Missouri, United States of America
- * E-mail:
| |
Collapse
|
43
|
Yang JY, Chen X. Improving taxonomy-based protein fold recognition by using global and local features. Proteins 2011; 79:2053-64. [DOI: 10.1002/prot.23025] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2010] [Revised: 02/05/2011] [Accepted: 03/03/2011] [Indexed: 11/05/2022]
|
44
|
Day R, Lennox KP, Dahl DB, Vannucci M, Tsai JW. Characterizing the regularity of tetrahedral packing motifs in protein tertiary structure. ACTA ACUST UNITED AC 2010; 26:3059-66. [PMID: 21047817 DOI: 10.1093/bioinformatics/btq573] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION While protein secondary structure is well understood, representing the repetitive nature of tertiary packing in proteins remains difficult. We have developed a construct called the relative packing group (RPG) that applies the clique concept from graph theory as a natural basis for defining the packing motifs in proteins. An RPG is defined as a clique of residues, where every member contacts all others as determined by the Delaunay tessellation. Geometrically similar RPGs define a regular element of tertiary structure or tertiary motif (TerMo). This intuitive construct provides a simple approach to characterize general repetitive elements of tertiary structure. RESULTS A dataset of over 4 million tetrahedral RPGs was clustered using different criteria to characterize the various aspects of regular tertiary structure in TerMos. Grouping this data within the SCOP classification levels of Family, Superfamily, Fold, Class and PDB showed that similar packing is shared across different folds. Classification of RPGs based on residue sequence locality reveals topological preferences according to protein sizes and secondary structure. We find that larger proteins favor RPGs with three local residues packed against a non-local residue. Classifying by secondary structure, helices prefer mostly local residues, sheets favor at least two local residues, while turns and coil populate with more local residues. To depict these TerMos, we have developed 2 complementary and intuitive representations: (i) Dirichlet process mixture density estimation of the torsion angle distributions and (ii) kernel density estimation of the Cartesian coordinate distribution. The TerMo library and representations software are available upon request.
Collapse
Affiliation(s)
- Ryan Day
- Department of Chemistry, University of the Pacific, Stockton, CA 95211, USA
| | | | | | | | | |
Collapse
|
45
|
iFC²: an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content. Amino Acids 2010; 40:963-73. [PMID: 20730460 DOI: 10.1007/s00726-010-0721-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 08/06/2010] [Indexed: 10/19/2022]
Abstract
Several descriptors of protein structure at the sequence and residue levels have been recently proposed. They are widely adopted in the analysis and prediction of structural and functional characteristics of proteins. Numerous in silico methods have been developed for sequence-based prediction of these descriptors. However, many of them do not have a public web-server and only a few integrate multiple descriptors to improve the predictions. We introduce iFC² (integrated prediction of fold, class, and content) server that is the first to integrate three modern predictors of sequence-level descriptors. They concern fold type (PFRES), structural class (SCEC), and secondary structure content (PSSC-core). The server exploits relations between the three descriptors to implement a cross-evaluation procedure that improves over the predictions of the individual methods. The iFC² annotates fold and class predictions as potentially correct/incorrect. When tested on datasets with low-similarity chains, for the fold prediction iFC² labels 82% of the PFRES predictions as correct and the accuracy of these predictions equals 72%. The accuracy of the remaining 28% of the PFRES predictions equals 38%. Similarly, our server assigns correct labels for over 79% of SCEC predictions, which are shown to be 98% accurate, while the remaining SCEC predictions are only 15% accurate. These results are shown to be competitive when contrasted against recent relevant web-servers. Predictions on CASP8 targets show that the content predicted by iFC² is competitive when compared with the content computed from the tertiary structures predicted by three best-performing methods in CASP8. The iFC² server is available at http://biomine.ece.ualberta.ca/1D/1D.html .
Collapse
|
46
|
Jain P, Hirst JD. Automatic structure classification of small proteins using random forest. BMC Bioinformatics 2010; 11:364. [PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2010] [Accepted: 07/01/2010] [Indexed: 11/29/2022] Open
Abstract
Background Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. Results Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. Conclusions The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | | |
Collapse
|
47
|
Yang Y, Lu BL. Protein subcellular multi-localization prediction using a min-max modular support vector machine. Int J Neural Syst 2010; 20:13-28. [PMID: 20180250 DOI: 10.1142/s0129065710002206] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Prediction of protein subcellular localization is an important issue in computational biology because it provides important clues for the characterization of protein functions. Currently, much research has been dedicated to developing automatic prediction tools. Most, however, focus on mono-locational proteins, i.e., they assume that proteins exist in only one location. It should be noted that many proteins bear multi-locational characteristics and carry out crucial functions in biological processes. This work aims to develop a general pattern classifier for predicting multiple subcellular locations of proteins. We use an ensemble classifier, called the min-max modular support vector machine (M(3)-SVM), to solve protein subcellular multi-localization problems; and, propose a module decomposition method based on gene ontology (GO) semantic information for M(3)-SVM. The amino acid composition with secondary structure and solvent accessibility information is adopted to represent features of protein sequences. We apply our method to two multi-locational protein data sets. The M(3)-SVMs show higher accuracy and efficiency than traditional SVMs using the same feature vectors. And the GO decomposition also helps to improve prediction accuracy. Moreover, our method has a much higher rate of accuracy than existing subcellular localization predictors in predicting protein multi-localization.
Collapse
Affiliation(s)
- Yang Yang
- Department of Computer Science and Engineering, Shanghai Maritime University, 1550 Haigang Ave., Shanghai, 201306, China.
| | | |
Collapse
|
48
|
Yang Y, Zhao J, Morgan RL, Ma W, Jiang T. Computational prediction of type III secreted proteins from gram-negative bacteria. BMC Bioinformatics 2010; 11 Suppl 1:S47. [PMID: 20122221 PMCID: PMC3009519 DOI: 10.1186/1471-2105-11-s1-s47] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Background Type III secretion system (T3SS) is a specialized protein delivery system in gram-negative bacteria that injects proteins (called effectors) directly into the eukaryotic host cytosol and facilitates bacterial infection. For many plant and animal pathogens, T3SS is indispensable for disease development. Recently, T3SS has also been found in rhizobia and plays a crucial role in the nodulation process. Although a great deal of efforts have been done to understand type III secretion, the precise mechanism underlying the secretion and translocation process has not been fully understood. In particular, defined secretion and translocation signals enabling the secretion have not been identified from the type III secreted effectors (T3SEs), which makes the identification of these important virulence factors notoriously challenging. The availability of a large number of sequenced genomes for plant and animal-associated bacteria demands the development of efficient and effective prediction methods for the identification of T3SEs using bioinformatics approaches. Results We have developed a machine learning method based on the N-terminal amino acid sequences to predict novel type III effectors in the plant pathogen Pseudomonas syringae and the microsymbiont rhizobia. The extracted features used in the learning model (or classifier) include amino acid composition, secondary structure and solvent accessibility information. The method achieved a precision of over 90% on P. syringae in a cross validation study. In combination with a promoter screen for the type III specific promoters, this classifier trained on the P. syringae data was applied to predict novel T3SEs from the genomic sequences of four rhizobial strains. This application resulted in 57 candidate type III secreted proteins, 17 of which are confirmed effectors. Conclusion Our experimental results demonstrate that the machine learning method based on N-terminal amino acid sequences combined with a promoter screen could prove to be a very effective computational approach for predicting novel type III effectors in gram-negative bacteria. Our method and data are available to the public upon request.
Collapse
Affiliation(s)
- Yang Yang
- Department of Computer Science and Engineering, Information Engineering College, Shanghai Maritime University, 1550 Haigang Ave, Shanghai 201306, PR China.
| | | | | | | | | |
Collapse
|
49
|
Chen P, Liu C, Burge L, Mahmood M, Southerland W, Gloster C. Protein fold classification with genetic algorithms and feature selection. J Bioinform Comput Biol 2009; 7:773-88. [PMID: 19785045 DOI: 10.1142/s0219720009004321] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 01/21/2009] [Accepted: 03/23/2009] [Indexed: 11/18/2022]
Abstract
Protein fold classification is a key step to predicting protein tertiary structures. This paper proposes a novel approach based on genetic algorithms and feature selection to classifying protein folds. Our dataset is divided into a training dataset and a test dataset. Each individual for the genetic algorithms represents a selection function of the feature vectors of the training dataset. A support vector machine is applied to each individual to evaluate the fitness value (fold classification rate) of each individual. The aim of the genetic algorithms is to search for the best individual that produces the highest fold classification rate. The best individual is then applied to the feature vectors of the test dataset and a support vector machine is built to classify protein folds based on selected features. Our experimental results on Ding and Dubchak's benchmark dataset of 27-class folds show that our approach achieves an accuracy of 71.28%, which outperforms current state-of-the-art protein fold predictors.
Collapse
Affiliation(s)
- Peng Chen
- Department of Systems and Computer Science, Howard University, 2300 Sixth Street, NW, Washington, DC 20059, USA.
| | | | | | | | | | | |
Collapse
|
50
|
Dong Q, Zhou S, Guan J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 2009; 25:2655-62. [DOI: 10.1093/bioinformatics/btp500] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|