1
|
Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022; 10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
- Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada
| | - Pedro Lenz Casa
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Fernanda Pessi de Abreu
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Daniel Luis Notari
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Scheila de Avila E Silva
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| |
Collapse
|
2
|
Kumar V, Lalotra GS, Sasikala P, Rajput DS, Kaluri R, Lakshmanna K, Shorfuzzaman M, Alsufyani A, Uddin M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare (Basel) 2022; 10:healthcare10071293. [PMID: 35885819 PMCID: PMC9322725 DOI: 10.3390/healthcare10071293] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/03/2022] [Accepted: 07/07/2022] [Indexed: 11/16/2022] Open
Abstract
Nowadays, healthcare is the prime need of every human being in the world, and clinical datasets play an important role in developing an intelligent healthcare system for monitoring the health of people. Mostly, the real-world datasets are inherently class imbalanced, clinical datasets also suffer from this imbalance problem, and the imbalanced class distributions pose several issues in the training of classifiers. Consequently, classifiers suffer from low accuracy, precision, recall, and a high degree of misclassification, etc. We performed a brief literature review on the class imbalanced learning scenario. This study carries the empirical performance evaluation of six classifiers, namely Decision Tree, k-Nearest Neighbor, Logistic regression, Artificial Neural Network, Support Vector Machine, and Gaussian Naïve Bayes, over five imbalanced clinical datasets, Breast Cancer Disease, Coronary Heart Disease, Indian Liver Patient, Pima Indians Diabetes Database, and Coronary Kidney Disease, with respect to seven different class balancing techniques, namely Undersampling, Random oversampling, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, and SMOTETOMEK. In addition to this, the appropriate explanations for the superiority of the classifiers as well as data-balancing techniques are also explored. Furthermore, we discuss the possible recommendations on how to tackle the class imbalanced datasets while training the different supervised machine learning methods. Result analysis demonstrates that SMOTEEN balancing method often performed better over all the other six data-balancing techniques with all six classifiers and for all five clinical datasets. Except for SMOTEEN, all other six balancing techniques almost had equal performance but moderately lesser performance than SMOTEEN.
Collapse
Affiliation(s)
- Vinod Kumar
- Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram 522302, India;
| | - Gotam Singh Lalotra
- Government Degree College Basohli, University of Jammu, Basohli 184201, India;
| | - Ponnusamy Sasikala
- New Media Technology, Makhanlal Chaturvedi National University of Journalism and Communication, Bhopal 462011, India;
| | - Dharmendra Singh Rajput
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
- Correspondence: (D.S.R.); (M.U.)
| | - Rajesh Kaluri
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
| | - Kuruva Lakshmanna
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
| | - Mohammad Shorfuzzaman
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia; (M.S.); (A.A.)
| | - Abdulmajeed Alsufyani
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia; (M.S.); (A.A.)
| | - Mueen Uddin
- College of Computing and IT University of Doha for Science and Technology, Doha P.O. Box 24449, Qatar
- Correspondence: (D.S.R.); (M.U.)
| |
Collapse
|
3
|
Tamposis IA, Sarantopoulou D, Theodoropoulou MC, Stasi EA, Kontou PI, Tsirigos KD, Bagos PG. Hidden neural networks for transmembrane protein topology prediction. Comput Struct Biotechnol J 2021; 19:6090-6097. [PMID: 34849210 PMCID: PMC8606341 DOI: 10.1016/j.csbj.2021.11.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 11/05/2021] [Accepted: 11/06/2021] [Indexed: 11/21/2022] Open
Abstract
Hidden Markov Models (HMMs) are amongst the most successful methods for predicting protein features in biological sequence analysis. However, there are biological problems where the Markovian assumption is not sufficient since the sequence context can provide useful information for prediction purposes. Several extensions of HMMs have appeared in the literature in order to overcome their limitations. We apply here a hybrid method that combines HMMs and Neural Networks (NNs), termed Hidden Neural Networks (HNNs), for biological sequence analysis in a straightforward manner. In this framework, the traditional HMM probability parameters are replaced by NN outputs. As a case study, we focus on the topology prediction of for alpha-helical and beta-barrel membrane proteins. The HNNs show performance gains compared to standard HMMs and the respective predictors outperform the top-scoring methods in the field. The implementation of HNNs can be found in the package JUCHMME, downloadable from http://www.compgen.org/tools/juchmme, https://github.com/pbagos/juchmme. The updated PRED-TMBB2 and HMM-TM prediction servers can be accessed at www.compgen.org.
Collapse
Key Words
- CHMM, Class Hidden Markov Models
- CML, Conditional Maximum Likelihood
- EM, Expectation-Maximization
- HMM, Hidden Markov Models
- HNN, Hidden Neural Networks
- Hidden Markov Models
- Hidden Neural Networks
- JUCHMME, Java Utility for Class Hidden Markov Models and Extensions
- MCC, Matthews Correlation Coefficient
- ML, Maximum Likelihood
- MSA, Multiple Sequence Alignment
- Membrane proteins
- NN, Neural Networks
- Neural Networks
- Protein structure prediction
- SOV, segment overlap
- Sequence analysis
Collapse
Affiliation(s)
- Ioannis A. Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Present address: National Institute on Aging, National Institutes of Health, Baltimore, Maryland, USA
| | | | - Evangelia A. Stasi
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | - Panagiota I. Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| | | | - Pantelis G. Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, 35100 Lamia, Greece
| |
Collapse
|
4
|
|
5
|
Splice sites detection using chaos game representation and neural network. Genomics 2019; 112:1847-1852. [PMID: 31704313 DOI: 10.1016/j.ygeno.2019.10.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 03/18/2019] [Accepted: 10/29/2019] [Indexed: 11/23/2022]
Abstract
A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.
Collapse
|
6
|
Dall'Alba G, Casa PL, Notari DL, Adami AG, Echeverrigaray S, de Avila E Silva S. Analysis of the nucleotide content of Escherichia coli promoter sequences related to the alternative sigma factors. J Mol Recognit 2018; 32:e2770. [PMID: 30458580 DOI: 10.1002/jmr.2770] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 10/23/2018] [Accepted: 10/24/2018] [Indexed: 01/26/2023]
Abstract
Promoters are DNA sequences located upstream of the transcription start site of genes. In bacteria, the RNA polymerase enzyme requires additional subunits, called sigma factors (σ) to begin specific gene transcription in distinct environmental conditions. Currently, promoter prediction still poses many challenges due to the characteristics of these sequences. In this paper, the nucleotide content of Escherichia coli promoter sequences, related to five alternative σ factors, was analyzed by a machine learning technique in order to provide profiles according to the σ factor which recognizes them. For this, the clustering technique was applied since it is a viable method for finding hidden patterns on a data set. As a result, 20 groups of sequences were formed, and, aided by the Weblogo tool, it was possible to determine sequence profiles. These found patterns should be considered for implementing computational prediction tools. In addition, evidence was found of an overlap between the functions of the genes regulated by different σ factors, suggesting that DNA structural properties are also essential parameters for further studies.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Pedro Lenz Casa
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Daniel Luis Notari
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Andre Gustavo Adami
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Sergio Echeverrigaray
- Department of Life Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| | - Scheila de Avila E Silva
- Department of Exact Sciences, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul, Brazil
| |
Collapse
|
7
|
Maleki E, Babashah H, Koohi S, Kavehvash Z. High-speed all-optical DNA local sequence alignment based on a three-dimensional artificial neural network. JOURNAL OF THE OPTICAL SOCIETY OF AMERICA. A, OPTICS, IMAGE SCIENCE, AND VISION 2017; 34:1173-1186. [PMID: 29036127 DOI: 10.1364/josaa.34.001173] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Accepted: 05/26/2017] [Indexed: 06/07/2023]
Abstract
This paper presents an optical processing approach for exploring a large number of genome sequences. Specifically, we propose an optical correlator for global alignment and an extended moiré matching technique for local analysis of spatially coded DNA, whose output is fed to a novel three-dimensional artificial neural network for local DNA alignment. All-optical implementation of the proposed 3D artificial neural network is developed and its accuracy is verified in Zemax. Thanks to its parallel processing capability, the proposed structure performs local alignment of 4 million sequences of 150 base pairs in a few seconds, which is much faster than its electrical counterparts, such as the basic local alignment search tool.
Collapse
|
8
|
Kasperski A, Kasperska R. A new approach to the automatic identification of organism evolution using neural networks. Biosystems 2016; 142-143:32-42. [PMID: 26975238 DOI: 10.1016/j.biosystems.2016.03.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Revised: 01/20/2016] [Accepted: 03/08/2016] [Indexed: 12/30/2022]
Abstract
Automatic identification of organism evolution still remains a challenging task, which is especially exiting, when the evolution of human is considered. The main aim of this work is to present a new idea to allow organism evolution analysis using neural networks. Here we show that it is possible to identify evolution of any organisms in a fully automatic way using the designed EvolutionXXI program, which contains implemented neural network. The neural network has been taught using cytochrome b sequences of selected organisms. Then, analyses have been carried out for the various exemplary organisms in order to demonstrate capabilities of the EvolutionXXI program. It is shown that the presented idea allows supporting existing hypotheses, concerning evolutionary relationships between selected organisms, among others, Sirenia and elephants, hippopotami and whales, scorpions and spiders, dolphins and whales. Moreover, primate (including human), tree shrew and yeast evolution has been reconstructed.
Collapse
Affiliation(s)
- Andrzej Kasperski
- Faculty of Biological Sciences, Department of Biotechnology, University of Zielona Gora, ul. Szafrana 1, 65-516 Zielona Gora, Poland.
| | - Renata Kasperska
- Institute of Occupational Safety Engineering and Work Science, University of Zielona Gora, ul. Szafrana 4, 65-516 Zielona Gora, Poland
| |
Collapse
|
9
|
Reaching optimized parameter set: protein secondary structure prediction using neural network. Neural Comput Appl 2016. [DOI: 10.1007/s00521-015-2150-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
10
|
Leow LK, Chew LL, Chong VC, Dhillon SK. Automated identification of copepods using digital image processing and artificial neural network. BMC Bioinformatics 2015; 16 Suppl 18:S4. [PMID: 26678287 PMCID: PMC4682403 DOI: 10.1186/1471-2105-16-s18-s4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Copepods are planktonic organisms that play a major role in the marine food chain. Studying the community structure and abundance of copepods in relation to the environment is essential to evaluate their contribution to mangrove trophodynamics and coastal fisheries. The routine identification of copepods can be very technical, requiring taxonomic expertise, experience and much effort which can be very time-consuming. Hence, there is an urgent need to introduce novel methods and approaches to automate identification and classification of copepod specimens. This study aims to apply digital image processing and machine learning methods to build an automated identification and classification technique. Results We developed an automated technique to extract morphological features of copepods' specimen from captured images using digital image processing techniques. An Artificial Neural Network (ANN) was used to classify the copepod specimens from species Acartia spinicauda, Bestiolina similis, Oithona aruensis, Oithona dissimilis, Oithona simplex, Parvocalanus crassirostris, Tortanus barbatus and Tortanus forcipatus based on the extracted features. 60% of the dataset was used for a two-layer feed-forward network training and the remaining 40% was used as testing dataset for system evaluation. Our approach demonstrated an overall classification accuracy of 93.13% (100% for A. spinicauda, B. similis and O. aruensis, 95% for T. barbatus, 90% for O. dissimilis and P. crassirostris, 85% for O. similis and T. forcipatus). Conclusions The methods presented in this study enable fast classification of copepods to the species level. Future studies should include more classes in the model, improving the selection of features, and reducing the time to capture the copepod images.
Collapse
|
11
|
Ashrafi P, Moss GP, Wilkinson SC, Davey N, Sun Y. The application of machine learning to the modelling of percutaneous absorption: an overview and guide. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2015; 26:181-204. [PMID: 25783869 DOI: 10.1080/1062936x.2015.1018941] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Machine learning (ML) methods have been applied to the analysis of a range of biological systems. This paper reviews the application of these methods to the problem domain of skin permeability and addresses critically some of the key issues. Specifically, ML methods offer great potential in both predictive ability and their ability to provide mechanistic insight to, in this case, the phenomena of skin permeation. However, they are beset by perceptions of a lack of transparency and, often, once a ML or related method has been published there is little impetus from other researchers to adopt such methods. This is usually due to the lack of transparency in some methods and the lack of availability of specific coding for running advanced ML methods. This paper reviews critically the application of ML methods to percutaneous absorption and addresses the key issue of transparency by describing in detail - and providing the detailed coding for - the process of running a ML method (in this case, a Gaussian process regression method). Although this method is applied here to the field of percutaneous absorption, it may be applied more broadly to any biological system.
Collapse
Affiliation(s)
- P Ashrafi
- a School of Computer Science , University of Hertfordshire , Hatfield , UK
| | | | | | | | | |
Collapse
|
12
|
Hernández-Serna A, Jiménez-Segura LF. Automatic identification of species with neural networks. PeerJ 2014; 2:e563. [PMID: 25392749 PMCID: PMC4226643 DOI: 10.7717/peerj.563] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2014] [Accepted: 08/16/2014] [Indexed: 11/22/2022] Open
Abstract
A new automatic identification system using photographic images has been designed to recognize fish, plant, and butterfly species from Europe and South America. The automatic classification system integrates multiple image processing tools to extract the geometry, morphology, and texture of the images. Artificial neural networks (ANNs) were used as the pattern recognition method. We tested a data set that included 740 species and 11,198 individuals. Our results show that the system performed with high accuracy, reaching 91.65% of true positive fish identifications, 92.87% of plants and 93.25% of butterflies. Our results highlight how the neural networks are complementary to species identification.
Collapse
Affiliation(s)
- Andrés Hernández-Serna
- Grupo de Ictiología, Instituto de Biología, Universidad de Antioquia , Medellín , Colombia ; Department of Biology, University of Puerto Rico-Río Piedras , San Juan, PR , USA
| | | |
Collapse
|
13
|
Goel N, Singh S, Aseri TC. A comparative analysis of soft computing techniques for gene prediction. Anal Biochem 2013; 438:14-21. [PMID: 23529114 DOI: 10.1016/j.ab.2013.03.015] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Revised: 03/05/2013] [Accepted: 03/14/2013] [Indexed: 11/17/2022]
Abstract
The rapid growth of genomic sequence data for both human and nonhuman species has made analyzing these sequences, especially predicting genes in them, very important and is currently the focus of many research efforts. Beside its scientific interest in the molecular biology and genomics community, gene prediction is of considerable importance in human health and medicine. A variety of gene prediction techniques have been developed for eukaryotes over the past few years. This article reviews and analyzes the application of certain soft computing techniques in gene prediction. First, the problem of gene prediction and its challenges are described. These are followed by different soft computing techniques along with their application to gene prediction. In addition, a comparative analysis of different soft computing techniques for gene prediction is given. Finally some limitations of the current research activities and future research directions are provided.
Collapse
Affiliation(s)
- Neelam Goel
- Department of Computer Science and Engineering, PEC University of Technology, Sector-12, Chandigarh 160 012, UT, India.
| | | | | |
Collapse
|
14
|
Abstract
In the past decade, various genomes have been sequenced in both plants and animals. The falling cost of genome sequencing manifests a great impact on the research community with respect to annotation of genomes. Genome annotation helps in understanding the biological functions of the sequences of these genomes. Gene prediction is one of the most important aspects of genome annotation and it is an open research problem in bioinformatics. A large number of techniques for gene prediction have been developed over the past few years. In this paper a theoretical review of soft computing techniques for gene prediction is presented. The problem of gene prediction, along with the issues involved in it, is first described. A brief description of soft computing techniques, before discussing their application to gene prediction, is then provided. In addition, a list of different soft computing techniques for gene prediction is compiled. Finally some limitations of the current research and future research directions are presented.
Collapse
|
15
|
Volpato V, Adelfio A, Pollastri G. Accurate prediction of protein enzymatic class by N-to-1 Neural Networks. BMC Bioinformatics 2013; 14 Suppl 1:S11. [PMID: 23368876 PMCID: PMC3548677 DOI: 10.1186/1471-2105-14-s1-s11] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We present a novel ab initio predictor of protein enzymatic class. The predictor can classify proteins, solely based on their sequences, into one of six classes extracted from the enzyme commission (EC) classification scheme and is trained on a large, curated database of over 6,000 non-redundant proteins which we have assembled in this work. The predictor is powered by an ensemble of N-to-1 Neural Network, a novel architecture which we have recently developed. N-to-1 Neural Networks operate on the full sequence and not on predefined features. All motifs of a predefined length (31 residues in this work) are considered and are compressed by an N-to-1 Neural Network into a feature vector which is automatically determined during training. We test our predictor in 10-fold cross-validation and obtain state of the art results, with a 96% correct classification and 86% generalized correlation. All six classes are predicted with a specificity of at least 80% and false positive rates never exceeding 7%. We are currently investigating enhanced input encoding schemes which include structural information, and are analyzing trained networks to mine motifs that are most informative for the prediction, hence, likely, functionally relevant.
Collapse
Affiliation(s)
- Viola Volpato
- School of Computer Science and Informatics, University College Dublin, Ireland
| | | | | |
Collapse
|
16
|
Zhang AB, Feng J, Ward RD, Wan P, Gao Q, Wu J, Zhao WZ. A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods. PLoS One 2012; 7:e30986. [PMID: 22363527 PMCID: PMC3282726 DOI: 10.1371/journal.pone.0030986] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2011] [Accepted: 12/29/2011] [Indexed: 11/19/2022] Open
Abstract
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.
Collapse
Affiliation(s)
- Ai-bing Zhang
- College of Life Sciences, Capital Normal University, Beijing, People's Republic of China.
| | | | | | | | | | | | | |
Collapse
|
17
|
Amin A, Mahmoud-Ghoneim D, Syam MI, Daoud S. Neural network assessment of herbal protection against chemotherapeutic-induced reproductive toxicity. Theor Biol Med Model 2012; 9:1. [PMID: 22272939 PMCID: PMC3293062 DOI: 10.1186/1742-4682-9-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2011] [Accepted: 01/24/2012] [Indexed: 12/17/2022] Open
Abstract
The aim of this study is to assess the protective effects of Ginkgo biloba's (GB) extract against chemotherapeutic-induced reproductive toxicity using a data mining tool, namely Neural Network Clustering (NNC) on two types of data: biochemical & fertility indicators and Texture Analysis (TA) parameters. GB extract (1 g/kg/day) was given orally to male albino rats for 26 days. This period began 21 days before a single cisplatin (CIS) intraperitoneal injection (10 mg/kg body weight). GB given orally significantly restored reproductive function. Tested extract also notably reduced the CIS-induced reproductive toxicity, as evidenced by restoring normal morphology of testes. In GB, the attenuation of CIS-induced damage was associated with less apoptotic cell death both in the testicular tissue and in the sperms. CIS-induced alterations of testicular lipid peroxidation were markedly improved by the examined plant extract. NNC has been used for classifying animal groups based on the quantified biochemical & fertility indicators and microscopic image texture parameters extracted by TA. NNC showed the separation of two clusters and the distribution of groups among them in a way that signifies the dose-dependent protective effect of GB. The present study introduces the neural network as a powerful tool to assess both biochemical and histopathological data. We also show here that herbal protection against CIS-induced reproductive toxicity utilizing classic methodologies is validated using neural network analysis.
Collapse
Affiliation(s)
- Amr Amin
- Biology Department, UAE University, University St, Al-Ain 17551, UAE.
| | | | | | | |
Collapse
|
18
|
MA QICHENG, WANG JASONTL. BIOLOGICAL DATA MINING USING BAYESIAN NEURAL NETWORKS: A CASE STUDY. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213099000294] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Biological data mining is the activity of finding significant information in biomolecular data. The significant information may refer to motifs, clusters, genes, and protein signatures. This paper presents an example of biological data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second-level to give the final result. Empirical study shows that a precision rate of 92.2% is achieved, indicating an excellent performance of the proposed approach.
Collapse
Affiliation(s)
- QICHENG MA
- Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| | - JASON T. L. WANG
- Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| |
Collapse
|
19
|
Rodrigues TDS, Cardoso FC, Teixeira SMR, Oliveira SC, Braga AP. Protein classification with Extended-Sequence Coding by sliding window. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1721-1726. [PMID: 21519118 DOI: 10.1109/tcbb.2011.78] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
A large number of unclassified sequences is still found in public databases, which suggests that there is still need for new investigations in the area. In this contribution, we present a methodology based on Artificial Neural Networks for protein functional classification. A new protein coding scheme, called here Extended-Sequence Coding by Sliding Windows, is presented with the goal of overcoming some of the difficulties of the well method Sequence Coding by Sliding Window. The new protein coding scheme uses more than one sliding window length with a weight factor that is proportional to the window length, avoiding the ambiguity problem without ignoring the identity of small subsequences Accuracy for Sequence Coding by Sliding Windows ranged from 60.1 to 77.7 percent for the first bacterium protein set and from 61.9 to 76.7 percent for the second one, whereas the accuracy for the proposed Extended-Sequence Coding by Sliding Windows scheme ranged from 70.7 to 97.1 percent for the first bacterium protein set and from 61.1 to 93.3 percent for the second one. Additionally, protein sequences classified inconsistently by the Artificial Neural Networks were analyzed by CD-Search revealing that there are some disagreement in public repositories, calling the attention for the relevant issue of error propagation in annotated databases due the incorrect transferred annotations.
Collapse
Affiliation(s)
- Thiago de Souza Rodrigues
- Computer Department, Federal Center of Technological Education of Minas Gerais, Av. Amazonas 5253, Nova Suiça, Belo Horizonte 30421-169, MG, Brazil
| | | | | | | | | |
Collapse
|
20
|
Guo J, Rao N. Predicting protein folding rate from amino acid sequence. J Bioinform Comput Biol 2011; 9:1-13. [PMID: 21328704 DOI: 10.1142/s0219720011005306] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2010] [Revised: 10/19/2010] [Accepted: 10/19/2010] [Indexed: 11/18/2022]
Abstract
Predicting protein folding rate from amino acid sequence is an important challenge in computational and molecular biology. Over the past few years, many methods have been developed to reflect the correlation between the folding rates and protein structures and sequences. In this paper, we present an effective method, a combined neural network--genetic algorithm approach, to predict protein folding rates only from amino acid sequences, without any explicit structural information. The originality of this paper is that, for the first time, it tackles the effect of sequence order. The proposed method provides a good correlation between the predicted and experimental folding rates. The correlation coefficient is 0.80 and the standard error is 2.65 for 93 proteins, the largest such databases of proteins yet studied, when evaluated with leave-one-out jackknife test. The comparative results demonstrate that this correlation is better than most of other methods, and suggest the important contribution of sequence order information to the determination of protein folding rates.
Collapse
Affiliation(s)
- Jianxiu Guo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China.
| | | |
Collapse
|
21
|
Abstract
Both supervised and unsupervised neural networks have been applied to the prediction of protein structure and function. Here, we focus on feedforward neural networks and describe how these learning machines can be applied to protein prediction. We discuss how to select an appropriate data set, how to choose and encode protein features into the neural network input, and how to assess the predictor's performance.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | |
Collapse
|
22
|
Abstract
As extensive mass spectrometry-based mapping of the phosphoproteome progresses, computational analysis of phosphorylation-dependent signaling becomes increasingly important. The linear sequence motifs that surround phosphorylated residues have successfully been used to characterize kinase-substrate specificity. Here, we briefly describe the available resources for predicting kinase-specific phosphorylation from sequence properties. We address the strengths and weaknesses of these resources, which are based on methods ranging from simple consensus patterns to more advanced machine-learning algorithms. Furthermore, a protocol for the use of the artificial neural network based predictors, NetPhos and NetPhosK, is provided. Finally, we point to possible developments with the intention of providing the community with improved and additional phosphorylation predictors for large-scale modeling of cellular signaling networks.
Collapse
Affiliation(s)
- Martin L Miller
- Technical University of Denmark, Center for Biological Sequence Analysis, Lyngby, Denmark
| | | |
Collapse
|
23
|
Zhang AB, Savolainen P. BPSI2.0: a C/C++ interface program for species identification via DNA barcoding with a BP-neural network by calling the Matlab engine. Mol Ecol Resour 2008; 9:104-6. [PMID: 21564572 DOI: 10.1111/j.1755-0998.2008.02372.x] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
BP-Species Identification (BPSI2.0) is a computer program that performs species identification by training a Back-Propagation Neural Network. A short DNA barcoding segment is used as input for training a three-layer BP network. The trained network can assign an unknown query sequence to a known species in the user's database, and provide the corresponding subvector value of the output vector as a relative probability value.
Collapse
Affiliation(s)
- A B Zhang
- Albanova University Center, KTH - Royal Institute of Biotechnology, SE-106 91 Stockholm, Sweden
| | | |
Collapse
|
24
|
Arav-Boger R, Boger YS, Foster CB, Boger Z. The use of artificial neural networks in prediction of congenital CMV outcome from sequence data. Bioinform Biol Insights 2008; 2:281-9. [PMID: 19812782 PMCID: PMC2735958 DOI: 10.4137/bbi.s764] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
A large number of CMV strains has been reported to circulate in the human population, and the biological significance of these strains is currently an active area of research. The analysis of complex genetic information may be limited using conventional phylogenetic techniques. We constructed artificial neural networks to determine their feasibility in predicting the outcome of congenital CMV disease (defined as presence of CMV symptoms at birth) based on two data sets: 54 sequences of CMV gene UL144 obtained from 54 amniotic fluids of women who contracted acute CMV infection during their pregnancy, and 80 sequences of 4 genes (US28, UL144, UL146 and UL147) obtained from urine, saliva or blood of 20 congenitally infected infants that displayed different outcomes at birth. When data from all four genes was used in the 20-infants’ set, the artificial neural network model accurately identified outcome in 90% of cases. While US28 and UL147 had low yield in predicting outcome, UL144 and UL146 predicted outcome in 80% and 85% respectively when used separately. The model identified specific nucleotide positions that were highly relevant to prediction of outcome. The artificial neural network classified genotypes in agreement with classic phylogenetic analysis. We suggest that artificial neural networks can accurately and efficiently analyze sequences obtained from larger cohorts to determine specific outcomes.\ The ANN training and analysis code is commercially available from Optimal Neural Informatics (Pikesville, MD).
Collapse
Affiliation(s)
- Ravit Arav-Boger
- Department of Pediatrics, Division of Infectious Diseases, Johns Hopkins Hospital, Baltimore, MD 21287, USA.
| | | | | | | |
Collapse
|
25
|
Zhang AB, Sikes DS, Muster C, Li SQ. Inferring Species Membership Using DNA Sequences with Back-Propagation Neural Networks. Syst Biol 2008; 57:202-15. [DOI: 10.1080/10635150802032982] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
Affiliation(s)
- A. B. Zhang
- Institute of Zoology, Chinese Academy of Sciences Beijing 100080, P. R. China; E-mail: ;
- Current Address: Albanova University Center, Royal Institute of BiotechnologySE-106 91 Stockholm, Sweden; E-mail:
| | - D. S. Sikes
- University of Alaska Museum 907 Yukon Drive, Fairbanks, Alaska 99775-6960, USA
| | - C. Muster
- Molecular Evolution and Animal Systematics, University of Leipzig Talstrasse 33, D-04103 Leipzig, Germany
| | - S. Q. Li
- Institute of Zoology, Chinese Academy of Sciences Beijing 100080, P. R. China; E-mail: ;
| |
Collapse
|
26
|
Chang EJ, Begum R, Chait BT, Gaasterland T. Prediction of cyclin-dependent kinase phosphorylation substrates. PLoS One 2007; 2:e656. [PMID: 17668044 PMCID: PMC1924601 DOI: 10.1371/journal.pone.0000656] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2007] [Accepted: 06/24/2007] [Indexed: 11/18/2022] Open
Abstract
Protein phosphorylation, mediated by a family of enzymes called cyclin-dependent kinases (Cdks), plays a central role in the cell-division cycle of eukaryotes. Phosphorylation by Cdks directs the cell cycle by modifying the function of regulators of key processes such as DNA replication and mitotic progression. Here, we present a novel computational procedure to predict substrates of the cyclin-dependent kinase Cdc28 (Cdk1) in the Saccharomyces cerevisiae. Currently, most computational phosphorylation site prediction procedures focus solely on local sequence characteristics. In the present procedure, we model Cdk substrates based on both local and global characteristics of the substrates. Thus, we define the local sequence motifs that represent the Cdc28 phosphorylation sites and subsequently model clustering of these motifs within the protein sequences. This restraint reflects the observation that many known Cdk substrates contain multiple clustered phosphorylation sites. The present strategy defines a subset of the proteome that is highly enriched for Cdk substrates, as validated by comparing it to a set of bona fide, published, experimentally characterized Cdk substrates which was to our knowledge, comprehensive at the time of writing. To corroborate our model, we compared its predictions with three experimentally independent Cdk proteomic datasets and found significant overlap. Finally, we directly detected in vivo phosphorylation at Cdk motifs for selected putative substrates using mass spectrometry.
Collapse
Affiliation(s)
- Emmanuel J Chang
- Department of Chemistry, York College of the City University of New York, Jamaica, New York, United States of America; Laboratory of Mass Spectrometry and Gaseous Ion Chemistry, Rockefeller University, New York, New York, United States of America.
| | | | | | | |
Collapse
|
27
|
SANTOS-MÉNDEZ JOSUÉ, HERNÁNDEZ SALVADOR. Effect of Recycle Streams on the Closed-Loop Dynamics of Thermally Coupled Distillation Sequences. CHEM ENG COMMUN 2007. [DOI: 10.1080/009864490522669] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
28
|
Hjerrild M, Gammeltoft S. Phosphoproteomics toolbox: Computational biology, protein chemistry and mass spectrometry. FEBS Lett 2006; 580:4764-70. [PMID: 16914146 DOI: 10.1016/j.febslet.2006.07.068] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2006] [Revised: 07/21/2006] [Accepted: 07/25/2006] [Indexed: 11/29/2022]
Abstract
Protein phosphorylation is important for regulation of most biological functions and up to 50% of all proteins are thought to be modified by protein kinases. Increased knowledge about potential phosphorylation of a protein may increase our understanding of the molecular processes in which it takes part. Despite the importance of protein phosphorylation, identification of phosphoproteins and localization of phosphorylation sites is still a major challenge in proteomics. However, high-throughput methods for identification of phosphoproteins are being developed, in particular within the fields of bioinformatics and mass spectrometry. In this review, we present a toolbox of current technology applied in phosphoproteomics including computational prediction, chemical approaches and mass spectrometry-based analysis, and propose an integrated strategy for experimental phosphoproteomics.
Collapse
Affiliation(s)
- Majbrit Hjerrild
- Department of Clinical Biochemistry, Glostrup Hospital, Nordre Ringvej, DK-2600 Glostrup, Denmark.
| | | |
Collapse
|
29
|
Ferraro E, Via A, Ausiello G, Helmer-Citterich M. A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity. ACTA ACUST UNITED AC 2006; 22:2333-9. [PMID: 16870929 DOI: 10.1093/bioinformatics/btl403] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Unravelling the rules underlying protein-protein and protein-ligand interactions is a crucial step in understanding cell machinery. Peptide recognition modules (PRMs) are globular protein domains which focus their binding targets on short protein sequences and play a key role in the frame of protein-protein interactions. High-throughput techniques permit the whole proteome scanning of each domain, but they are characterized by a high incidence of false positives. In this context, there is a pressing need for the development of in silico experiments to validate experimental results and of computational tools for the inference of domain-peptide interactions. RESULTS We focused on the SH3 domain family and developed a machine-learning approach for inferring interaction specificity. SH3 domains are well-studied PRMs which typically bind proline-rich short sequences characterized by the PxxP consensus. The binding information is known to be held in the conformation of the domain surface and in the short sequence of the peptide. Our method relies on interaction data from high-throughput techniques and benefits from the integration of sequence and structure data of the interacting partners. Here, we propose a novel encoding technique aimed at representing binding information on the basis of the domain-peptide contact residues in complexes of known structure. Remarkably, the new encoding requires few variables to represent an interaction, thus avoiding the 'curse of dimension'. Our results display an accuracy >90% in detecting new binders of known SH3 domains, thus outperforming neural models on standard binary encodings, profile methods and recent statistical predictors. The method, moreover, shows a generalization capability, inferring specificity of unknown SH3 domains displaying some degree of similarity with the known data.
Collapse
Affiliation(s)
- E Ferraro
- Centre of Molecular Bioinformatics, Department of Biology, University of Tor Vergata Rome, Italy.
| | | | | | | |
Collapse
|
30
|
Wang H, Azuaje F, Black N. Improving biomolecular pattern discovery and visualization with hybrid self-adaptive networks. IEEE Trans Nanobioscience 2006; 1:146-66. [PMID: 16689206 DOI: 10.1109/tnb.2003.809465] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
There is an increasing need to develop powerful techniques to improve biomedical pattern discovery and visualization. This paper presents an automated approach, based on hybrid self-adaptive neural networks, to pattern identification and visualization for biomolecular data. The methods are tested on two datasets: leukemia expression data and DNA splice-junction sequences. Several supervised and unsupervised models are implemented and compared. A comprehensive evaluation study of some of their intrinsic mechanisms is presented. The results suggest that these tools may be useful to support biological knowledge discovery based on advanced classification and visualization tasks.
Collapse
Affiliation(s)
- Haiying Wang
- School of Computing and Mathematics, University of Ulster, Jordanstown BT37 0QB, UK.
| | | | | |
Collapse
|
31
|
ZHAO XINGMING, DU JIXIANG, WANG HONGQIANG, ZHU YUNPING, LI YIXUE. A NEW TECHNIQUE FOR SELECTING FEATURES FROM PROTEIN SEQUENCES. INT J PATTERN RECOGN 2006. [DOI: 10.1142/s021800140600465x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A new method for selecting features from protein sequences is proposed in this paper. First, the protein sequences are converted into fixed-dimensional feature vectors. Then, a subset of features is selected using relative entropy method and used as the inputs for Support Vector Machine (SVM). Finally, the trained SVM classifier is utilized to classify protein sequences into certain known protein families. Experimental results over proteins obtained from PIR database and GPCRs have shown that our proposed approach is really effective and efficient in selecting features from protein sequences.
Collapse
Affiliation(s)
- XING-MING ZHAO
- Institute of Intelligent Machines, Chinese Academy of Sciences, P. O. Box.1130, Hefei, Anhui 230031, P. R. China
| | - JI-XIANG DU
- Institute of Intelligent Machines, Chinese Academy of Sciences, P. O. Box.1130, Hefei, Anhui 230031, P. R. China
| | - HONG-QIANG WANG
- Institute of Intelligent Machines, Chinese Academy of Sciences, P. O. Box.1130, Hefei, Anhui 230031, P. R. China
| | - YUNPING ZHU
- Beijing Institute of Radiation Medicine, Taiping Road 27, Beijing 100850, P. R. China
| | - YIXUE LI
- Bioinformatics Center, Shanghai Institutes for Biological Sciences, CAS, 320 Yue Yang Road, Shanghai, 200031, P. R. China
| |
Collapse
|
32
|
Ferraro E, Via A, Ausiello G, Helmer-Citterich M. A neural strategy for the inference of SH3 domain-peptide interaction specificity. BMC Bioinformatics 2005; 6 Suppl 4:S13. [PMID: 16351739 PMCID: PMC1866395 DOI: 10.1186/1471-2105-6-s4-s13] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Background The SH3 domain family is one of the most representative and widely studied cases of so-called Peptide Recognition Modules (PRM). The polyproline II motif PxxP that generally characterizes its ligands does not reflect the complex interaction spectrum of the over 1500 different SH3 domains, and the requirement of a more refined knowledge of their specificity implies the setting up of appropriate experimental and theoretical strategies. Due to the limitations of the current technology for peptide synthesis, several experimental high-throughput approaches have been devised to elucidate protein-protein interaction mechanisms. Such approaches can rely on and take advantage of computational techniques, such as regular expressions or position specific scoring matrices (PSSMs) to pre-process entire proteomes in the search for putative SH3 targets. In this regard, a reliable inference methodology to be used for reducing the sequence space of putative binding peptides represents a valuable support for molecular and cellular biologists. Results Using as benchmark the peptide sequences obtained from in vitro binding experiments, we set up a neural network model that performs better than PSSM in the detection of SH3 domain interactors. In particular our model is more precise in its predictions, even if its performance can vary among different SH3 domains and is strongly dependent on the number of binding peptides in the benchmark. Conclusion We show that a neural network can be more effective than standard methods in SH3 domain specificity detection. Neural classifiers identify general SH3 domain binders and domain-specific interactors from a PxxP peptide population, provided that there are a sufficient proportion of true positives in the training sets. This capability can also improve peptide selection for library definition in array experiments. Further advances can be achieved, including properly encoded domain sequences and structural information as input for a global neural network.
Collapse
Affiliation(s)
- Enrico Ferraro
- Centre for Molecular Bioinformatics, Department of Biology, University of Tor Vergata, Rome, Italy
| | - Allegra Via
- Centre for Molecular Bioinformatics, Department of Biology, University of Tor Vergata, Rome, Italy
| | - Gabriele Ausiello
- Centre for Molecular Bioinformatics, Department of Biology, University of Tor Vergata, Rome, Italy
| | - Manuela Helmer-Citterich
- Centre for Molecular Bioinformatics, Department of Biology, University of Tor Vergata, Rome, Italy
| |
Collapse
|
33
|
Hjerrild M, Stensballe A, Rasmussen TE, Kofoed CB, Blom N, Sicheritz-Ponten T, Larsen MR, Brunak S, Jensen ON, Gammeltoft S. Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry. J Proteome Res 2004; 3:426-33. [PMID: 15253423 DOI: 10.1021/pr0341033] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein phosphorylation plays a key role in cell regulation and identification of phosphorylation sites is important for understanding their functional significance. Here, we present an artificial neural network algorithm: NetPhosK (http://www.cbs.dtu.dk/services/NetPhosK/) that predicts protein kinase A (PKA) phosphorylation sites. The neural network was trained with a positive set of 258 experimentally verified PKA phosphorylation sites. The predictions by NetPhosK were validated using four novel PKA substrates: Necdin, RFX5, En-2, and Wee 1. The four proteins were phosphorylated by PKA in vitro and 13 PKA phosphorylation sites were identified by mass spectrometry. NetPhosK was 100% sensitive and 41% specific in predicting PKA sites in the four proteins. These results demonstrate the potential of using integrated computational and experimental methods for detailed investigations of the phosphoproteome.
Collapse
Affiliation(s)
- Majbrit Hjerrild
- Department of Clinical Biochemistry, Glostrup Hospital, Nordre Ringvej 57, DK-2600 Glostrup, Denmark.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput Appl 2004. [DOI: 10.1007/s00521-004-0447-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
35
|
Burden S, Lin YX, Zhang R. Improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences. Bioinformatics 2004; 21:601-7. [PMID: 15454410 DOI: 10.1093/bioinformatics/bti047] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Although a great deal of research has been undertaken in the area of promoter prediction, prediction techniques are still not fully developed. Many algorithms tend to exhibit poor specificity, generating many false positives, or poor sensitivity. The neural network prediction program NNPP2.2 is one such example. RESULTS To improve the NNPP2.2 prediction technique, the distance between the transcription start site (TSS) associated with the promoter and the translation start site (TLS) of the subsequent gene coding region has been studied for Escherichia coli K12 bacteria. An empirical probability distribution that is consistent for all E.coli promoters has been established. This information is combined with the results from NNPP2.2 to create a new technique called TLS-NNPP, which improves the specificity of promoter prediction. The technique is shown to be effective using E.coli DNA sequences, however, it is applicable to any organism for which a set of promoters has been experimentally defined. AVAILABILITY The data used in this project and the prediction results for the tested sequences can be obtained from http://www.uow.edu.au/~yanxia/E_Coli_paper/SBurden_Results.xls CONTACT alh98@uow.edu.au.
Collapse
Affiliation(s)
- S Burden
- Department of Mathematics and Applied Statistics, University of Wollongong Wollongong, NSW 2522, Australia.
| | | | | |
Collapse
|
36
|
Berry EA, Dalby AR, Yang ZR. Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms. Comput Biol Chem 2004; 28:75-85. [PMID: 15022646 DOI: 10.1016/j.compbiolchem.2003.11.005] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Protein phosphorylation is a post-translational modification performed by a group of enzymes known as the protein kinases or phosphotransferases (Enzyme Commission classification 2.7). It is essential to the correct functioning of both proteins and cells, being involved with enzyme control, cell signalling and apoptosis. The major problem when attempting prediction of these sites is the broad substrate specificity of the enzymes. This study employs back-propagation neural networks (BPNNs), the decision tree algorithm C4.5 and the reduced bio-basis function neural network (rBBFNN) to predict phosphorylation sites. The aim is to compare prediction efficiency of the three algorithms for this problem, and examine knowledge extraction capability. All three algorithms are effective for phosphorylation site prediction. Results indicate that rBBFNN is the fastest and most sensitive of the algorithms. BPNN has the highest area under the ROC curve and is therefore the most robust, and C4.5 has the highest prediction accuracy. C4.5 also reveals the amino acid 2 residues upstream from the phosporylation site is important for serine/threonine phosphorylation, whilst the amino acid 3 residues upstream is important for tyrosine phosphorylation.
Collapse
Affiliation(s)
- Emily A Berry
- Department of Computer Science, School of Engineering, Computer Science and Mathematics, University of Exeter, UK.
| | | | | |
Collapse
|
37
|
Oakley BA, Hanna DM. A Review of Nanobioscience and Bioinformatics Initiatives in North America. IEEE Trans Nanobioscience 2004; 3:74-84. [PMID: 15382648 DOI: 10.1109/tnb.2003.820259] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Barbara A Oakley
- School of Engineering and Computer Science, Oakland University, Rochester, MI 48309, USA.
| | | |
Collapse
|
38
|
Kalate RN, Tambe SS, Kulkarni BD. Artificial neural networks for prediction of mycobacterial promoter sequences. Comput Biol Chem 2004; 27:555-64. [PMID: 14667783 DOI: 10.1016/j.compbiolchem.2003.09.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
A multilayered feed-forward ANN architecture trained using the error-back-propagation (EBP) algorithm has been developed for predicting whether a given nucleotide sequence is a mycobacterial promoter sequence. Owing to the high prediction capability ( congruent with 97%) of the developed network model, it has been further used in conjunction with the caliper randomization (CR) approach for determining the structurally/functionally important regions in the promoter sequences. The results obtained thereby indicate that: (i) upstream region of -35 box, (ii) -35 region, (iii) spacer region and, (iv) -10 box, are important for mycobacterial promoters. The CR approach also suggests that the -38 to -29 region plays a significant role in determining whether a given sequence is a mycobacterial promoter. In essence, the present study establishes ANNs as a tool for predicting mycobacterial promoter sequences and determining structurally/functionally important sub-regions therein.
Collapse
Affiliation(s)
- Rupali N Kalate
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
| | | | | |
Collapse
|
39
|
Qicheng Ma, Wang J, Shasha D, Wu C. DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. ACTA ACUST UNITED AC 2001. [DOI: 10.1109/5326.983930] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
40
|
Frimurer TM, Bywater R, Naerum L, Lauritsen LN, Brunak S. Improving the odds in discriminating "drug-like" from "non drug-like" compounds. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 2000; 40:1315-24. [PMID: 11128089 DOI: 10.1021/ci0003810] [Citation(s) in RCA: 82] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We have used a feed-forward neural network technique to classify chemical compounds into potentially "drug-like" and "non drug-like" candidates. The neural network was trained to distinguish between a set of "drug-like" and "non drug-like" chemical compounds taken from the MACCS-II Drug Data Report (MDDR) and the Available Chemicals Directory (ACD). The 2D atom types (of the full atomic representation) were assigned and applied as descriptors to encode numerically each compound. There are four main conclusions: First the method performs well, correctly assigning 88% of the compounds in both MDDR and ACD. Improved discrimination was achieved by a more critical selection of training sets. Second, the method gives much better prediction performance than the widely used "Rule of Five", which accepts as many as 74% of the ACD compounds but only 66% of those in MDDR, resulting in a correlation coefficient which is effectively zero, compared to a value of 0.63 for the neural network prediction. Third, based on a standard Tanimoto similarity search the selection of drug-like compounds in the evaluation set is not biased toward compounds similar to those in the training set. Fourth, the trained neural network was applied to evaluate the drug-likeness of 136 GABA uptake inhibitors with impressive results. The implications of applying a neural network to characterize chemical compounds are discussed.
Collapse
|
41
|
Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000. [DOI: 10.1016/s0731-7085(99)00272-1 pmid: 10815714] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
42
|
Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000; 22:717-27. [PMID: 10815714 DOI: 10.1016/s0731-7085(99)00272-1] [Citation(s) in RCA: 493] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Artificial neural networks (ANNs) are biologically inspired computer programs designed to simulate the way in which the human brain processes information. ANNs gather their knowledge by detecting the patterns and relationships in data and learn (or are trained) through experience, not from programming. An ANN is formed from hundreds of single units, artificial neurons or processing elements (PE), connected with coefficients (weights), which constitute the neural structure and are organised in layers. The power of neural computations comes from connecting neurons in a network. Each PE has weighted inputs, transfer function and one output. The behavior of a neural network is determined by the transfer functions of its neurons, by the learning rule, and by the architecture itself. The weights are the adjustable parameters and, in that sense, a neural network is a parameterized system. The weighed sum of the inputs constitutes the activation of the neuron. The activation signal is passed through transfer function to produce a single output of the neuron. Transfer function introduces non-linearity to the network. During training, the inter-unit connections are optimized until the error in predictions is minimized and the network reaches the specified level of accuracy. Once the network is trained and tested it can be given new input information to predict the output. Many types of neural networks have been designed already and new ones are invented every week but all can be described by the transfer functions of their neurons, by the learning rule, and by the connection formula. ANN represents a promising modeling technique, especially for data sets having non-linear relationships which are frequently encountered in pharmaceutical processes. In terms of model specification, artificial neural networks require no knowledge of the data source but, since they often contain many weights that must be estimated, they require large training sets. In addition, ANNs can combine and incorporate both literature-based and experimental data to solve problems. The various applications of ANNs can be summarised into classification or pattern recognition, prediction and modeling. Supervised 'associating networks can be applied in pharmaceutical fields as an alternative to conventional response surface methodology. Unsupervised feature-extracting networks represent an alternative to principal component analysis. Non-adaptive unsupervised networks are able to reconstruct their patterns when presented with noisy samples and can be used for image recognition. The potential applications of ANN methodology in the pharmaceutical sciences range from interpretation of analytical data, drug and dosage form design through biopharmacy to clinical pharmacy.
Collapse
|
43
|
Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999; 294:1351-62. [PMID: 10600390 DOI: 10.1006/jmbi.1999.3310] [Citation(s) in RCA: 2378] [Impact Index Per Article: 91.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein phosphorylation at serine, threonine or tyrosine residues affects a multitude of cellular signaling processes. How is specificity in substrate recognition and phosphorylation by protein kinases achieved? Here, we present an artificial neural network method that predicts phosphorylation sites in independent sequences with a sensitivity in the range from 69 % to 96 %. As an example, we predict novel phosphorylation sites in the p300/CBP protein that may regulate interaction with transcription factors and histone acetyltransferase activity. In addition, serine and threonine residues in p300/CBP that can be modified by O-linked glycosylation with N-acetylglucosamine are identified. Glycosylation may prevent phosphorylation at these sites, a mechanism named yin-yang regulation. The prediction server is available on the Internet at http://www.cbs.dtu.dk/services/NetPhos/or via e-mail to NetPhos@cbs. dtu.dk.
Collapse
Affiliation(s)
- N Blom
- Department of Biotechnology, The Technical University of Denmark, Lyngby, DK-2800, Denmark
| | | | | |
Collapse
|
44
|
|
45
|
Gupta R, Jung E, Gooley AA, Williams KL, Brunak S, Hansen J. Scanning the available Dictyostelium discoideum proteome for O-linked GlcNAc glycosylation sites using neural networks. Glycobiology 1999; 9:1009-22. [PMID: 10521537 DOI: 10.1093/glycob/9.10.1009] [Citation(s) in RCA: 82] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Dictyostelium discoideum has been suggested as a eukaryotic model organism for glycobiology studies. Presently, the characteristics of acceptor sites for the N-acetylglucosaminyl-transferases in Dictyostelium discoideum, which link GlcNAc in an alpha linkage to hydroxyl residues, are largely unknown. This motivates the development of a species specific method for prediction of O-linked GlcNAc glycosylation sites in secreted and membrane proteins of D. discoideum. The method presented here employs a jury of artificial neural networks. These networks were trained to recognize the sequence context and protein surface accessibility in 39 experimentally determined O-alpha-GlcNAc sites found in D. discoideum glycoproteins expressed in vivo. Cross-validation of the data revealed a correlation in which 97% of the glycosylated and nonglycosylated sites were correctly identified. Based on the currently limited data set, an abundant periodicity of two (positions-3, -1, +1, +3, etc.) in Proline residues alternating with hydroxyl amino acids was observed upstream and downstream of the acceptor site. This was a consequence of the spacing of the glycosylated residues themselves which were peculiarly found to be situated only at even positions with respect to each other, indicating that these may be located within beta-strands. The method has been used for a rapid and ranked scan of the fraction of the Dictyostelium proteome available in public databases, remarkably 25-30% of which were predicted glycosylated. The scan revealed acceptor sites in several proteins known experimentally to be O-glycosylated at unmapped sites. The available proteome was classified into functional and cellular compartments to study any preferential patterns of glycosylation. A sequence based prediction server for GlcNAc O-glycosylations in D. discoideum proteins has been made available through the WWW at http://www.cbs.dtu.dk/services/DictyOGlyc/ and via E-mail to DictyOGlyc@cbs.dtu.dk.
Collapse
Affiliation(s)
- R Gupta
- Department of Biotechnology, Technical University of Denmark, Lyngby, Denmark
| | | | | | | | | | | |
Collapse
|
46
|
Wang HC, Dopazo J, de la Fraga LG, Zhu YP, Carazo JM. Self-organizing tree-growing network for the classification of protein sequences. Protein Sci 1998; 7:2613-22. [PMID: 9865956 PMCID: PMC2143887 DOI: 10.1002/pro.5560071215] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The self-organizing tree algorithm (SOTA) was recently introduced to construct phylogenetic trees from biological sequences, based on the principles of Kohonen's self-organizing maps and on Fritzke's growing cell structures. SOTA is designed in such a way that the generation of new nodes can be stopped when the sequences assigned to a node are already above a certain similarity threshold. In this way a phylogenetic tree resolved at a high taxonomic level can be obtained. This capability is especially useful to classify sets of diversified sequences. SOTA was originally designed to analyze pre-aligned sequences. It is now adapted to be able to analyze patterns associated to the frequency of residues along a sequence, such as protein dipeptide composition and other n-gram compositions. In this work we show that the algorithm applied to these data is able to not only successfully construct phylogenetic trees of protein families, such as cytochrome c, triosephophate isomerase, and hemoglobin alpha chains, but also classify very diversified sequence data sets, such as a mixture of interleukins and their receptors.
Collapse
Affiliation(s)
- H C Wang
- Centro Nacional de Biotecnologia-CSIC, Universidad Autonoma, Madrid, Spain
| | | | | | | | | |
Collapse
|