1
|
Dao FY, Yang H, Su ZD, Yang W, Wu Y, Hui D, Chen W, Tang H, Lin H. Recent Advances in Conotoxin Classification by Using Machine Learning Methods. Molecules 2017; 22:molecules22071057. [PMID: 28672838 PMCID: PMC6152242 DOI: 10.3390/molecules22071057] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 06/12/2017] [Accepted: 06/19/2017] [Indexed: 11/16/2022] Open
Abstract
Conotoxins are disulfide-rich small peptides, which are invaluable peptides that target ion channel and neuronal receptors. Conotoxins have been demonstrated as potent pharmaceuticals in the treatment of a series of diseases, such as Alzheimer's disease, Parkinson's disease, and epilepsy. In addition, conotoxins are also ideal molecular templates for the development of new drug lead compounds and play important roles in neurobiological research as well. Thus, the accurate identification of conotoxin types will provide key clues for the biological research and clinical medicine. Generally, conotoxin types are confirmed when their sequence, structure, and function are experimentally validated. However, it is time-consuming and costly to acquire the structure and function information by using biochemical experiments. Therefore, it is important to develop computational tools for efficiently and effectively recognizing conotoxin types based on sequence information. In this work, we reviewed the current progress in computational identification of conotoxins in the following aspects: (i) construction of benchmark dataset; (ii) strategies for extracting sequence features; (iii) feature selection techniques; (iv) machine learning methods for classifying conotoxins; (v) the results obtained by these methods and the published tools; and (vi) future perspectives on conotoxin classification. The paper provides the basis for in-depth study of conotoxins and drug therapy research.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China.
| | - Yun Wu
- College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China.
| | - Ding Hui
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
2
|
Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3981478. [PMID: 27631006 PMCID: PMC5008028 DOI: 10.1155/2016/3981478] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 07/31/2016] [Indexed: 12/31/2022]
Abstract
Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.
Collapse
|
3
|
Shatnawi M, Abdallah S. Improving Handwritten Arabic Character Recognition by Modeling Human Handwriting Distortions. ACM T ASIAN LOW-RESO 2016. [DOI: 10.1145/2764456] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Handwritten Arabic character recognition systems face several challenges, including the unlimited variation in human handwriting and the unavailability of large public databases of handwritten characters and words. The use of synthetic data for training and testing handwritten character recognition systems is one of the possible solutions to provide several variations for these characters and to overcome the lack of large databases. While this can be using arbitrary distortions, such as image noise and randomized affine transformations, such distortions are not realistic. In this work, we model real distortions in handwriting using real handwritten Arabic character examples and then use these distortion models to synthesize handwritten examples that are more realistic. We show that the use of our proposed approach leads to significant improvements across different machine-learning classification algorithms.
Collapse
|
4
|
Bioinformatics-Aided Venomics. Toxins (Basel) 2015; 7:2159-87. [PMID: 26110505 PMCID: PMC4488696 DOI: 10.3390/toxins7062159] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 06/03/2015] [Accepted: 06/05/2015] [Indexed: 12/12/2022] Open
Abstract
Venomics is a modern approach that combines transcriptomics and proteomics to explore the toxin content of venoms. This review will give an overview of computational approaches that have been created to classify and consolidate venomics data, as well as algorithms that have helped discovery and analysis of toxin nucleic acid and protein sequences, toxin three-dimensional structures and toxin functions. Bioinformatics is used to tackle specific challenges associated with the identification and annotations of toxins. Recognizing toxin transcript sequences among second generation sequencing data cannot rely only on basic sequence similarity because toxins are highly divergent. Mass spectrometry sequencing of mature toxins is challenging because toxins can display a large number of post-translational modifications. Identifying the mature toxin region in toxin precursor sequences requires the prediction of the cleavage sites of proprotein convertases, most of which are unknown or not well characterized. Tracing the evolutionary relationships between toxins should consider specific mechanisms of rapid evolution as well as interactions between predatory animals and prey. Rapidly determining the activity of toxins is the main bottleneck in venomics discovery, but some recent bioinformatics and molecular modeling approaches give hope that accurate predictions of toxin specificity could be made in the near future.
Collapse
|
5
|
Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015; 14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
6
|
Chen Y, Zhou W, Wang H, Yuan Z. Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection. Med Biol Eng Comput 2015; 53:535-44. [PMID: 25752770 DOI: 10.1007/s11517-015-1268-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 03/02/2015] [Indexed: 12/21/2022]
Abstract
Protein glycosylation is one of the most important and complex post-translational modification that provides greater proteomic diversity than any other post-translational modification. Fast and reliable computational methods to identify glycosylation sites are in great demand. Two key issues, feature encoding and feature selection, can critically affect the accuracy of a computational method. We present a new O-glycosylation sites prediction method using only amino acid sequence information. The method includes the following components: (1) on the basis of multi-scale theory, features based on multi-scale composition of amino acids were extracted from the training sequences with identified glycosylation sites; (2) perform a two-stage feature selection to remove features that had adverse effects on the prediction, including a stage one preliminary filtering with Student's t test, and a second stage screening through iterative elimination using novel pairwise comparisons conducted in random subspace using support vector machine. Important features retained are used to build prediction model. The method is evaluated with sequence-based tenfold cross-validation tests on balanced datasets. The results of our experiments show that our method significantly outperforms those reported in the literature in terms of sensitivity, specificity, accuracy, Matthew's correlation coefficient. The prediction accuracy of serine and threonine residues sites reached 95.7 and 92.7%. The Matthew correlation coefficient of our method for S and T sites is 0.914 and 0.873, respectively. This method can evaluate each feature with the interactions of the rest of the features, which are still included in the model and have the advantage of high efficiency.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, 410128, China
| | | | | | | |
Collapse
|
7
|
Bouktif S, Hanna EM, Zaki N, Khousa EA. Ant colony optimization algorithm for interpretable Bayesian classifiers combination: application to medical predictions. PLoS One 2014; 9:e86456. [PMID: 24498276 PMCID: PMC3911928 DOI: 10.1371/journal.pone.0086456] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 12/14/2013] [Indexed: 11/30/2022] Open
Abstract
Prediction and classification techniques have been well studied by machine learning researchers and developed for several real-word problems. However, the level of acceptance and success of prediction models are still below expectation due to some difficulties such as the low performance of prediction models when they are applied in different environments. Such a problem has been addressed by many researchers, mainly from the machine learning community. A second problem, principally raised by model users in different communities, such as managers, economists, engineers, biologists, and medical practitioners, etc., is the prediction models’ interpretability. The latter is the ability of a model to explain its predictions and exhibit the causality relationships between the inputs and the outputs. In the case of classification, a successful way to alleviate the low performance is to use ensemble classiers. It is an intuitive strategy to activate collaboration between different classifiers towards a better performance than individual classier. Unfortunately, ensemble classifiers method do not take into account the interpretability of the final classification outcome. It even worsens the original interpretability of the individual classifiers. In this paper we propose a novel implementation of classifiers combination approach that does not only promote the overall performance but also preserves the interpretability of the resulting model. We propose a solution based on Ant Colony Optimization and tailored for the case of Bayesian classifiers. We validate our proposed solution with case studies from medical domain namely, heart disease and Cardiotography-based predictions, problems where interpretability is critical to make appropriate clinical decisions. Availability The datasets, Prediction Models and software tool together with supplementary materials are available at http://faculty.uaeu.ac.ae/salahb/ACO4BC.htm.
Collapse
Affiliation(s)
- Salah Bouktif
- Software Development, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
- * E-mail:
| | - Eileen Marie Hanna
- Intelligent Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| | - Nazar Zaki
- Intelligent Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| | - Eman Abu Khousa
- Enterprise Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| |
Collapse
|
8
|
Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicol In Vitro 2013; 27:852-6. [DOI: 10.1016/j.tiv.2012.12.024] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2012] [Revised: 12/06/2012] [Accepted: 12/22/2012] [Indexed: 11/20/2022]
|
9
|
Koua D, Laht S, Kaplinski L, Stöcklin R, Remm M, Favreau P, Lisacek F. Position-specific scoring matrix and hidden Markov model complement each other for the prediction of conopeptide superfamilies. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1834:717-24. [PMID: 23352837 DOI: 10.1016/j.bbapap.2012.12.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Revised: 12/01/2012] [Accepted: 12/26/2012] [Indexed: 10/27/2022]
Abstract
Classified into 16 superfamilies, conopeptides are the main component of cone snail venoms that attract growing interest in pharmacology and drug discovery. The conventional approach to assigning a conopeptide to a superfamily is based on a consensus signal peptide of the precursor sequence. While this information is available at the genomic or transcriptomic levels, it is not present in amino acid sequences of mature bioactives generated by proteomic studies. As the number of conopeptide sequences is increasing exponentially with the improvement in sequencing techniques, there is a growing need for automating superfamily elucidation. To face this challenge we have defined distinct models of the signal sequence, propeptide region and mature peptides for each of the superfamilies containing more than 5 members (14 out of 16). These models rely on two robust techniques namely, Position-Specific Scoring Matrices (PSSM, also named generalized profiles) and hidden Markov models (HMM). A total of 50 PSSMs and 47 HMM profiles were generated. We confirm that propeptide and mature regions can be used to efficiently classify conopeptides lacking a signal sequence. Furthermore, the combination of all three-region models demonstrated improvement in the classification rates and results emphasise how PSSM and HMM approaches complement each other for superfamily determination. The 97 models were validated and offer a straightforward method applicable to large sequence datasets.
Collapse
Affiliation(s)
- Dominique Koua
- Atheris Laboratories, Case Postale 314, CH-1233 Bernex-Geneva, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
10
|
Zaki N, Berengueres J, Efimov D. Detection of protein complexes using a protein ranking algorithm. Proteins 2012; 80:2459-68. [PMID: 22685080 DOI: 10.1002/prot.24130] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Revised: 05/31/2012] [Accepted: 06/01/2012] [Indexed: 12/24/2022]
Abstract
Detecting protein complexes from protein-protein interaction (PPI) network is becoming a difficult challenge in computational biology. There is ample evidence that many disease mechanisms involve protein complexes, and being able to predict these complexes is important to the characterization of the relevant disease for diagnostic and treatment purposes. This article introduces a novel method for detecting protein complexes from PPI by using a protein ranking algorithm (ProRank). ProRank quantifies the importance of each protein based on the interaction structure and the evolutionarily relationships between proteins in the network. A novel way of identifying essential proteins which are known for their critical role in mediating cellular processes and constructing protein complexes is proposed and analyzed. We evaluate the performance of ProRank using two PPI networks on two reference sets of protein complexes created from Munich Information Center for Protein Sequence, containing 81 and 162 known complexes, respectively. We compare the performance of ProRank to some of the well known protein complex prediction methods (ClusterONE, CMC, CFinder, MCL, MCode and Core) in terms of precision and recall. We show that ProRank predicts more complexes correctly at a competitive level of precision and recall. The level of the accuracy achieved using ProRank in comparison to other recent methods for detecting protein complexes is a strong argument in favor of the proposed method.
Collapse
Affiliation(s)
- Nazar Zaki
- Faculty of Information Technology, UAEU, Al Ain, UAE.
| | | | | |
Collapse
|
11
|
Laht S, Koua D, Kaplinski L, Lisacek F, Stöcklin R, Remm M. Identification and classification of conopeptides using profile Hidden Markov Models. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2011; 1824:488-92. [PMID: 22244925 DOI: 10.1016/j.bbapap.2011.12.004] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2011] [Revised: 12/13/2011] [Accepted: 12/19/2011] [Indexed: 10/14/2022]
Abstract
Conopeptides are small toxins produced by predatory marine snails of the genus Conus. They are studied with increasing intensity due to their potential in neurosciences and pharmacology. The number of existing conopeptides is estimated to be 1 million, but only about 1000 have been described to date. Thanks to new high-throughput sequencing technologies the number of known conopeptides is likely to increase exponentially in the near future. There is therefore a need for a fast and accurate computational method for identification and classification of the novel conopeptides in large data sets. 62 profile Hidden Markov Models (pHMMs) were built for prediction and classification of all described conopeptide superfamilies and families, based on the different parts of the corresponding protein sequences. These models showed very high specificity in detection of new peptides. 56 out of 62 models do not give a single false positive in a test with the entire UniProtKB/Swiss-Prot protein sequence database. Our study demonstrates the usefulness of mature peptide models for automatic classification with accuracy of 96% for the mature peptide models and 100% for the pro- and signal peptide models. Our conopeptide profile HMMs can be used for finding and annotation of new conopeptides from large datasets generated by transcriptome or genome sequencing. To our knowledge this is the first time this kind of computational method has been applied to predict all known conopeptide superfamilies and some conopeptide families.
Collapse
|