1
|
Jing X, Dong Q, Hong D, Lu R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1918-1931. [PMID: 30998480 DOI: 10.1109/tcbb.2019.2911677] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
Collapse
|
2
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 340] [Impact Index Per Article: 68.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
3
|
Müller AT, Kaymaz AC, Gabernet G, Posselt G, Wessler S, Hiss JA, Schneider G. Sparse Neural Network Models of Antimicrobial Peptide-Activity Relationships. Mol Inform 2016; 35:606-614. [DOI: 10.1002/minf.201600029] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Accepted: 06/13/2016] [Indexed: 01/07/2023]
Affiliation(s)
- Alex T. Müller
- Swiss Federal Institute of Technology (ETH); Department of Chemistry and Applied Biosciences; Vladimir-Prelog-Weg 4 CH-8093 Zurich Switzerland
| | - Aral C. Kaymaz
- Swiss Federal Institute of Technology (ETH); Department of Chemistry and Applied Biosciences; Vladimir-Prelog-Weg 4 CH-8093 Zurich Switzerland
| | - Gisela Gabernet
- Swiss Federal Institute of Technology (ETH); Department of Chemistry and Applied Biosciences; Vladimir-Prelog-Weg 4 CH-8093 Zurich Switzerland
| | - Gernot Posselt
- Department of Molecular Biology, Division of Microbiology, Paris Lodron; University of Salzburg; Billrothstr. 11 A-5020 Salzburg Austria
| | - Silja Wessler
- Department of Molecular Biology, Division of Microbiology, Paris Lodron; University of Salzburg; Billrothstr. 11 A-5020 Salzburg Austria
| | - Jan A. Hiss
- Swiss Federal Institute of Technology (ETH); Department of Chemistry and Applied Biosciences; Vladimir-Prelog-Weg 4 CH-8093 Zurich Switzerland
| | - Gisbert Schneider
- Swiss Federal Institute of Technology (ETH); Department of Chemistry and Applied Biosciences; Vladimir-Prelog-Weg 4 CH-8093 Zurich Switzerland
| |
Collapse
|
4
|
Fong Y, Datta S, Georgiev IS, Kwong PD, Tomaras GD. Kernel-based logistic regression model for protein sequence without vectorialization. Biostatistics 2014; 16:480-92. [PMID: 25532524 DOI: 10.1093/biostatistics/kxu056] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 11/13/2014] [Indexed: 11/12/2022] Open
Abstract
Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.
Collapse
Affiliation(s)
- Youyi Fong
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98006, USA
| | - Saheli Datta
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98006, USA
| | - Ivelin S Georgiev
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Peter D Kwong
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Georgia D Tomaras
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC 27710, USA
| |
Collapse
|
5
|
Sadovskaya NS, Sutormin RA, Gelfand MS. RECOGNITION OF TRANSMEMBRANE SEGMENTS IN PROTEINS: REVIEW AND CONSISTENCY-BASED BENCHMARKING OF INTERNET SERVERS. J Bioinform Comput Biol 2011; 4:1033-56. [PMID: 17099940 DOI: 10.1142/s0219720006002326] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2006] [Revised: 06/21/2006] [Accepted: 06/22/2006] [Indexed: 11/18/2022]
Abstract
Membrane proteins perform a number of crucial functions as transporters, receptors, and components of enzyme complexes. Identification of membrane proteins and prediction of their topology is thus an important part of genome annotation. We present here an overview of transmembrane segments in protein sequences, summarize data from large-scale genome studies, and report results of benchmarking of several popular internet servers.
Collapse
Affiliation(s)
- Nataliya S Sadovskaya
- Institute for Information Transmission Problems, Russian Academy of Science, Bolshoi Karetny per. 19, Moscow 127994, Russia.
| | | | | |
Collapse
|
6
|
Suitable transmembrane domain significantly increase the surface-expression level of Fc(epsilon)RIalpha in 293T cells. J Biotechnol 2008; 139:195-202. [PMID: 19110016 DOI: 10.1016/j.jbiotec.2008.11.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Revised: 10/03/2008] [Accepted: 11/23/2008] [Indexed: 11/20/2022]
Abstract
Evidence showed that the extracellular part of Fc(epsilon)RIalpha (FCR) with its own transmembrane domain (TMF) cannot be expressed as a transmembrane form in CHO cell line. However, FCR could be displayed on cell surface with the transmembrane domain (TM) of human IL2Ralpha (TMI). Theoretical analysis of TMF and TMI using TM prediction methods showed that TMI possessed strong orientation tendency to form "outside to inside" transmembrane mode from N-terminal to C-terminal, while TMF was prone to form "inside to outside" mode. Based on the analyzing results, the TM of Her2 (TMH) was studied and showed similar transmembrane mode as that of TMI, which implied that TMH might be a novel TM to obtain the surface display of FCR. Then, DNA sequences encoding TMH and TMF were fused to 3'-end of FCR gene, respectively. Fluorescent microscope observation indicated that FCR_TMH seemed to be located mainly on cell surface, while FCR_TMF appeared in endochylema. Flow cytometry analysis and Western blot also showed that the surface expression of FCR was enhanced significantly by TMH, while FCR_TMF could not be surface displayed in 293T cell. The experimental results were consistent with the theoretical predictions and demonstrated that the orientation tendency of TM may be very important in subcellular location of proteins.
Collapse
|
7
|
Huang RB, Du QS, Wei YT, Pang ZW, Wei H, Chou KC. Physics and chemistry-driven artificial neural network for predicting bioactivity of peptides and proteins and their design. J Theor Biol 2008; 256:428-35. [PMID: 18835398 DOI: 10.1016/j.jtbi.2008.08.028] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2008] [Revised: 08/25/2008] [Accepted: 08/25/2008] [Indexed: 10/21/2022]
Abstract
Predicting the bioactivity of peptides and proteins is an important challenge in drug development and protein engineering. In this study we introduce a novel approach, the so-called "physics and chemistry-driven artificial neural network (Phys-Chem ANN)", to deal with such a problem. Unlike the existing ANN approaches, which were designed under the inspiration of biological neural system, the Phys-Chem ANN approach is based on the physical and chemical principles, as well as the structural features of proteins. In the Phys-Chem ANN model the "hidden layers" are no longer virtual "neurons", but real structural units of proteins and peptides. It is a hybridization approach, which combines the linear free energy concept of quantitative structure-activity relationship (QSAR) with the advanced mathematical technique of ANN. The Phys-Chem ANN approach has adopted an iterative and feedback procedure, incorporating both machine-learning and artificial intelligence capabilities. In addition to making more accurate predictions for the bioactivities of proteins and peptides than is possible with the traditional QSAR approach, the Phys-Chem ANN approach can also provide more insights about the relationship between bioactivities and the structures involved than the ANN approach does. As an example of the application of the Phys-Chem ANN approach, a predictive model for the conformational stability of human lysozyme is presented.
Collapse
Affiliation(s)
- Ri-Bo Huang
- Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530004, China
| | | | | | | | | | | |
Collapse
|
8
|
Yoo PD, Ho YS, Zhou BB, Zomaya AY. SiteSeek: post-translational modification analysis using adaptive locality-effective kernel methods and new profiles. BMC Bioinformatics 2008; 9:272. [PMID: 18541042 PMCID: PMC2442102 DOI: 10.1186/1471-2105-9-272] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 06/10/2008] [Indexed: 11/10/2022] Open
Abstract
Background Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors. Results The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC). Conclusion Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.
Collapse
Affiliation(s)
- Paul D Yoo
- Advanced Networks Research Group, School of Information Technologies (J12), The University of Sydney, Sydney, NSW 2006, Australia.
| | | | | | | |
Collapse
|
9
|
Yang JY, Yang MQ, Dunker AK, Deng Y, Huang X. Investigation of transmembrane proteins using a computational approach. BMC Genomics 2008; 9 Suppl 1:S7. [PMID: 18366620 PMCID: PMC2386072 DOI: 10.1186/1471-2164-9-s1-s7] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND An important subfamily of membrane proteins are the transmembrane alpha-helical proteins, in which the membrane-spanning regions are made up of alpha-helices. Given the obvious biological and medical significance of these proteins, it is of tremendous practical importance to identify the location of transmembrane segments. The difficulty of inferring the secondary or tertiary structure of transmembrane proteins using experimental techniques has led to a surge of interest in applying techniques from machine learning and bioinformatics to infer secondary structure from primary structure in these proteins. We are therefore interested in determining which physicochemical properties are most useful for discriminating transmembrane segments from non-transmembrane segments in transmembrane proteins, and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins, and in using the results of these investigations to develop classifiers to identify transmembrane segments in transmembrane proteins. RESULTS We determined that the most useful properties for discriminating transmembrane segments from non-transmembrane segments and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins were hydropathy, polarity, and flexibility, and used the results of this analysis to construct classifiers to discriminate transmembrane segments from non-transmembrane segments using four classification techniques: two variants of the Self-Organizing Global Ranking algorithm, a decision tree algorithm, and a support vector machine algorithm. All four techniques exhibited good performance, with out-of-sample accuracies of approximately 75%. CONCLUSIONS Several interesting observations emerged from our study: intrinsically unstructured segments and transmembrane segments tend to have opposite properties; transmembrane proteins appear to be much richer in intrinsically unstructured segments than other proteins; and, in approximately 70% of transmembrane proteins that contain intrinsically unstructured segments, the intrinsically unstructured segments are close to transmembrane segments.
Collapse
Affiliation(s)
- Jack Y Yang
- Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Mary Qu Yang
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University Schools of Medicine and Informatics, 410 W. 10th Street, Indianapolis, IN 46202, USA
| | - Youping Deng
- Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, 39406, USA
| | - Xudong Huang
- Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
10
|
Plewczynski D, Tkacz A, Wyrwicz LS, Rychlewski L, Ginalski K. AutoMotif Server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update. J Mol Model 2007; 14:69-76. [PMID: 17994256 DOI: 10.1007/s00894-007-0250-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2007] [Accepted: 10/12/2007] [Indexed: 10/22/2022]
Abstract
We present here the recent update of AutoMotif Server (AMS 2.0) that predicts post-translational modification sites in protein sequences. The support vector machine (SVM) algorithm was trained on data gathered in 2007 from various sets of proteins containing experimentally verified chemical modifications of proteins. Short sequence segments around a modification site were dissected from a parent protein, and represented in the training set as binary or profile vectors. The updated efficiency of the SVM classification for each type of modification and the predictive power of both representations were estimated using leave-one-out tests for model of general phosphorylation and for modifications catalyzed by several specific protein kinases. The accuracy of the method was improved in comparison to the previous version of the service (Plewczynski et al., "AutoMotif server: prediction of single residue post-translational modifications in proteins", Bioinformatics 21: 2525-7, 2005). The precision of the updated version reached over 90% for selected types of phosphorylation and was optimized in trade of lower recall value of the classification model. The AutoMotif Server version 2007 is freely available at http://ams2.bioinfo.pl/ . Additionally, the reference dataset for optimization of prediction of phosphorylation sites, collected from the UniProtKB was also provided and can be accessed at http://ams2.bioinfo.pl/data/ .
Collapse
Affiliation(s)
- Dariusz Plewczynski
- Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Pawinskiego 5a, 02-106, Warsaw, Poland.
| | | | | | | | | |
Collapse
|
11
|
A Strategy for the Identification of Canonical and Non-canonical MHC I-binding Epitopes Using an ANN-based Epitope Prediction Algorithm. ACTA ACUST UNITED AC 2006. [DOI: 10.1002/qsar.200510154] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
12
|
Plewczynski D, Tkacz A, Wyrwicz LS, Godzik A, Kloczkowski A, Rychlewski L. Support-vector-machine classification of linear functional motifs in proteins. J Mol Model 2005; 12:453-61. [PMID: 16341901 DOI: 10.1007/s00894-005-0070-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2005] [Accepted: 10/18/2005] [Indexed: 10/25/2022]
Abstract
Our algorithm predicts short linear functional motifs in proteins using only sequence information. Statistical models for short linear functional motifs in proteins are built using the database of short sequence fragments taken from proteins in the current release of the Swiss-Prot database. Those segments are confirmed by experiments to have single-residue post-translational modification. The sensitivities of the classification for various types of short linear motifs are in the range of 70%. The query protein sequence is dissected into short overlapping fragments. All segments are represented as vectors. Each vector is then classified by a machine learning algorithm (Support Vector Machine) as potentially modifiable or not. The resulting list of plausible post-translational sites in the query protein is returned to the user. We also present a study of the human protein kinase C family as a biological application of our method.
Collapse
|
13
|
Milac AL, Avram S, Petrescu AJ. Evaluation of a neural networks QSAR method based on ligand representation using substituent descriptors. Application to HIV-1 protease inhibitors. J Mol Graph Model 2005; 25:37-45. [PMID: 16325439 DOI: 10.1016/j.jmgm.2005.09.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2005] [Revised: 06/17/2005] [Accepted: 09/29/2005] [Indexed: 11/18/2022]
Abstract
We present here a neural networks method designed to predict biological activity based on a local representation of the ligand. The compounds of the series are represented by a vector mapping for each of four substituent properties: volume, log P, dipole moment and a simple 'steric' parameter relating to its shape. This ligand representation was tested using neural networks on a set of 42 cyclic-urea derivatives, inhibiting HIV-1 protease. The leave-one-out cross-validation using all descriptors in the input gave a correlation factor between prediction and experiment of 0.76 for the overall set and 0.88 when three outliers were left out. To rank the significance of the four descriptors, we further tested all combinations of two and three parameters for each substituent, using two disjunctive testing sets of five inhibitors. In these sets, vectors with extreme descriptor values were used either in the training or the testing set (sets A and B, respectively). The method is a very good interpolator (set A, 95+/-2% accuracy) but a less effective extrapolator (set B, 85+/-2% accuracy). Generally, the combinations including the 'steric' parameter predict better than average, while those containing the volume are less effective. The best prediction, 98.8+/-1.2%, was obtained when log P, the dipole and the steric parameter were used on set A. At the opposite end, the lowest ranked descriptor set was obtained when replacing log P with the volume, giving 92.3+/-6.7% accuracy over the set A.
Collapse
Affiliation(s)
- Adina-Luminiţa Milac
- Institute of Biochemistry, Splaiul Independenţei 296, Sector 6, Bucharest, Romania
| | | | | |
Collapse
|
14
|
Beiko RG, Charlebois RL. GANN: genetic algorithm neural networks for the detection of conserved combinations of features in DNA. BMC Bioinformatics 2005; 6:36. [PMID: 15725347 PMCID: PMC553964 DOI: 10.1186/1471-2105-6-36] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2004] [Accepted: 02/22/2005] [Indexed: 11/16/2022] Open
Abstract
Background The multitude of motif detection algorithms developed to date have largely focused on the detection of patterns in primary sequence. Since sequence-dependent DNA structure and flexibility may also play a role in protein-DNA interactions, the simultaneous exploration of sequence- and structure-based hypotheses about the composition of binding sites and the ordering of features in a regulatory region should be considered as well. The consideration of structural features requires the development of new detection tools that can deal with data types other than primary sequence. Results GANN (available at ) is a machine learning tool for the detection of conserved features in DNA. The software suite contains programs to extract different regions of genomic DNA from flat files and convert these sequences to indices that reflect sequence and structural composition or the presence of specific protein binding sites. The machine learning component allows the classification of different types of sequences based on subsamples of these indices, and can identify the best combinations of indices and machine learning architecture for sequence discrimination. Another key feature of GANN is the replicated splitting of data into training and test sets, and the implementation of negative controls. In validation experiments, GANN successfully merged important sequence and structural features to yield good predictive models for synthetic and real regulatory regions. Conclusion GANN is a flexible tool that can search through large sets of sequence and structural feature combinations to identify those that best characterize a set of sequences.
Collapse
Affiliation(s)
- Robert G Beiko
- Institute for Molecular Bioscience, The University of Queensland, Brisbane 4072, Australia
- Department of Biology, University of Ottawa, Ottawa, ON, K1N 6N5, Canada
| | - Robert L Charlebois
- Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, B3H 1X5, Canada
| |
Collapse
|
15
|
Qiu J, Liang R, Zou X, Mo J. Prediction of Transmembrane Proteins Based on the Continuous Wavelet Transform. ACTA ACUST UNITED AC 2004; 44:741-7. [PMID: 15032556 DOI: 10.1021/ci0303868] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A novel method based on continuous wavelet transform (CWT) for predicting the number and location of helices in membrane proteins is presented. Two bacteria proteins are chosen as examples to describe the prediction of transmembrane helices (HTM) by using this method. Selections of an appropriate dilation and hydrophobicity data types are discussed in the text. The results indicate that CWT is a promising approach for the prediction of HTM.
Collapse
Affiliation(s)
- Jianding Qiu
- School of Chemistry and Chemical Engineering, Zhongshan University, Guangzhou 510275, People's Republic of China
| | | | | | | |
Collapse
|
16
|
Abstract
In this article, a membrane-propensity scale for amino acids is derived using only two ingredients: (i) a set of transmembrane helices segments from membrane protein crystal structures and (ii) the request that each component of the set has a free energy lower than that of a typical soluble protein sequence of the same length. Although the most widely used hydropathy scales satisfy this request, we use an optimization procedure that allows for extraction of an optimal scale, which correlates equally well with those scales. We show that, if the choice of the sequence database is accurate, significant knowledge-based scales, which are robust with respect to changes in the learning set, can be easily derived. The obtained scales can be used for transmembrane helices prediction. The predictive power of one of these scales is tested on membrane proteins, soluble proteins, and signal peptides databases, finding that its performances is comparable with those of the hydropathy scales.
Collapse
Affiliation(s)
- Marco Punta
- International School for Advanced Studies (SISSA), and Istituto Nazionale di Fisica della Materia, Via Beirut 2-4, 34014 Trieste, Italy
| | | |
Collapse
|
17
|
Smith AE, Nugent CD, McClean SI. Evaluation of inherent performance of intelligent medical decision support systems: utilising neural networks as an example. Artif Intell Med 2003; 27:1-27. [PMID: 12473389 DOI: 10.1016/s0933-3657(02)00088-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Researchers who design intelligent systems for medical decision support, are aware of the need for response to real clinical issues, in particular the need to address the specific ethical problems that the medical domain has in using black boxes. This means such intelligent systems have to be thoroughly evaluated, for acceptability. Attempts at compliance, however, are hampered by lack of guidelines. This paper addresses the issue of inherent performance evaluation, which researchers have addressed in part, but a Medline search, using neural networks as an example of intelligent systems, indicated that only about 12.5% evaluated inherent performance adequately. This paper aims to address this issue by concentrating on the possible evaluation methodology, giving a framework and specific suggestions for each type of classification problem. This should allow the developers of intelligent systems to produce evidence of a sufficiency of output performance evaluation.
Collapse
Affiliation(s)
- A E Smith
- Medical Informatics, Faculty of Informatics, University of Ulster, Jordanstown, Newtownabbey, BT37 0QB, Northern Ireland, Antrim, UK.
| | | | | |
Collapse
|
18
|
Smith AE, Nugent CD, McClean SI. Implementation of intelligent decision support systems in health care. JOURNAL OF MANAGEMENT IN MEDICINE 2002; 16:206-18. [PMID: 12211346 DOI: 10.1108/02689230210434943] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The full implementation of any intelligent system in health care, which is designed for decision support, has several stages, from initial problem identification through development and, finally, cost-benefit analysis. Central to this is formal objectivist evaluation with its core component of inherent performance of the outputs from these systems. A Medline survey of one type of intelligent system is presented, which demonstrates that this issue is not being addressed adequately. Lack of criteria for dealing with the outputs from these "black box" systems to prescribe adequate levels of inherent performance may be preventing their being accepted by those in the health-care domain and, thus, their being applied widely in the field.
Collapse
|
19
|
Simon I, Fiser A, Tusnády GE. Predicting protein conformation by statistical methods. BIOCHIMICA ET BIOPHYSICA ACTA 2001; 1549:123-36. [PMID: 11690649 DOI: 10.1016/s0167-4838(01)00253-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The unique folded structure makes a polypeptide a functional protein. The number of known sequences is about a hundred times larger than the number of known structures and the gap is increasing rapidly. The primary goal of all structure prediction methods is to obtain structure-related information on proteins, whose structures have not been determined experimentally. Besides this goal, the development of accurate prediction methods helps to reveal principles of protein folding. Here we present a brief survey of protein structure predictions based on statistical analyses of known sequence and structure data. We discuss the background of these methods and attempt to elucidate principles, which govern structure formation of soluble and membrane proteins.
Collapse
Affiliation(s)
- I Simon
- Institute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary.
| | | | | |
Collapse
|
20
|
Kövesdi I, Dominguez-Rodriguez MF, Orfi L, Náray-Szabó G, Varró A, Papp JG, Mátyus P. Application of neural networks in structure-activity relationships. Med Res Rev 1999; 19:249-69. [PMID: 10232652 DOI: 10.1002/(sici)1098-1128(199905)19:3<249::aid-med4>3.0.co;2-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Methodology and application of artificial neural networks in structure-activity relationships are reviewed focusing on the most frequently used three-layer feedforward back-propagation procedure. Two applications of neural networks are presented and a comparison of the performance with those of CoMFA and a classical QSAR analysis is also discussed.
Collapse
Affiliation(s)
- I Kövesdi
- EGIS Pharmaceuticals Ltd., Budapest, Hungary
| | | | | | | | | | | | | |
Collapse
|
21
|
Schneider G, Wrede P. Artificial neural networks for computer-based molecular design. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 1998; 70:175-222. [PMID: 9830312 DOI: 10.1016/s0079-6107(98)00026-1] [Citation(s) in RCA: 135] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The theory of artificial neural networks is briefly reviewed focusing on supervised and unsupervised techniques which have great impact on current chemical applications. An introduction to molecular descriptors and representation schemes is given. In addition, worked examples of recent advances in this field are highlighted and pioneering publications are discussed. Applications of several types of artificial neural networks to compound classification, modelling of structure-activity relationships, biological target identification, and feature extraction from biopolymers are presented and compared to other techniques. Advantages and limitations of neural networks for computer-aided molecular design and sequence analysis are discussed.
Collapse
Affiliation(s)
- G Schneider
- F. Hoffmann-La Roche Ltd., Pharmaceuticals Division, Basel, Switzerland.
| | | |
Collapse
|
22
|
Tusnády GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 1998; 283:489-506. [PMID: 9769220 DOI: 10.1006/jmbi.1998.2107] [Citation(s) in RCA: 806] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
A new method is suggested here for topology prediction of helical transmembrane proteins. The method is based on the hypothesis that the localizations of the transmembrane segments and the topology are determined by the difference in the amino acid distributions in various structural parts of these proteins rather than by specific amino acid compositions of these parts. A hidden Markov model with special architecture was developed to search transmembrane topology corresponding to the maximum likelihood among all the possible topologies of a given protein. The prediction accuracy was tested on 158 proteins and was found to be higher than that found using prediction methods already available. The method successfully predicted all the transmembrane segments in 143 proteins out of the 158, and for 135 of these proteins both the membrane spanning regions and the topologies were predicted correctly. The observed level of accuracy is a strong argument in favor of our hypothesis.
Collapse
Affiliation(s)
- G E Tusnády
- Institute of Enzymology. Biological Research Center, Hungarian Academy of Sciences, H-1518 Budapest, Hungary
| | | |
Collapse
|
23
|
Milik M, Sauer D, Brunmark AP, Yuan L, Vitiello A, Jackson MR, Peterson PA, Skolnick J, Glass CA. Application of an artificial neural network to predict specific class I MHC binding peptide sequences. Nat Biotechnol 1998; 16:753-6. [PMID: 9702774 DOI: 10.1038/nbt0898-753] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Computational methods were used to predict the sequences of peptides that bind to the MHC class I molecule, K(b). The rules for predicting binding sequences, which are limited, are based on preferences for certain amino acids in certain positions of the peptide. It is apparent though, that binding can be influenced by the amino acids in all of the positions of the peptide. An artificial neural network (ANN) has the ability to simultaneously analyze the influence of all of the amino acids of the peptide and thus may improve binding predictions. ANNs were compared to statistically analyzed peptides for their abilities to predict the sequences of K(b) binding peptides. ANN systems were trained on a library of binding and nonbinding peptide sequences from a phage display library. Statistical and ANN methods identified strong binding peptides with preferred amino acids. ANNs detected more subtle binding preferences, enabling them to predict medium binding peptides. The ability to predict class I MHC molecule binding peptides is useful for immunolological therapies involving cytotoxic-T cells.
Collapse
Affiliation(s)
- M Milik
- R.W. Johnson Pharmaceutical Research Institute, San Diego, CA 92121, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Abstract
Artificial neural networks provide a unique computing architecture whose potential has attracted interest from researchers across different disciplines. As a technique for computational analysis, neural network technology is very well suited for the analysis of molecular sequence data. It has been applied successfully to a variety of problems, ranging from gene identification, to protein structure prediction and sequence classification. This article provides an overview of major neural network paradigms, discusses design issues, and reviews current applications in DNA/RNA and protein sequence analysis.
Collapse
Affiliation(s)
- C H Wu
- Department of Epidemiology/Biomathematics, University of Texas Health Center at Tyler 75710, USA.
| |
Collapse
|
25
|
Juretić D, Lučić B, Zucić D, Trinajstić N. Protein transmembrane structure: recognition and prediction by using hydrophobicity scales through preference functions. THEORETICAL AND COMPUTATIONAL CHEMISTRY 1998. [DOI: 10.1016/s1380-7323(98)80015-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
26
|
Abstract
In the past years, much effort has been put on the development of new methodologies and algorithms for the prediction of protein secondary and tertiary structures from (sequence) data; this is reviewed in detail. New approaches for these predictions such as neural network methods, genetic algorithms, machine learning, and graph theoretical methods are discussed. Secondary structure prediction algorithms were improved mostly by considering families of related proteins; however, for the reliable tertiary structure modeling of proteins, knowledge-based techniques are still preferred. Methods and examples with more or less successful results are described. Also, programs and parameterizations for energy minimisations, molecular dynamics, and electrostatic interactions have been improved, especially with respect to their former limits of applicability. Other topics discussed in this review include the use of traditional and on-line databases, the docking problem and surface properties of biomolecules, packing of protein cores, de novo design and protein engineering, prediction of membrane protein structures, the verification and reliability of model structures, and progress made with currently available software and computer hardware. In summary, the prediction of the structure, function, and other properties of a protein is still possible only within limits, but these limits continue to be moved.
Collapse
Affiliation(s)
- G Böhm
- Institut für Biotechnologie, Martin-Luther-Universität Halle-Wittenberg, Germany
| |
Collapse
|
27
|
Lohmann R, Schneider G, Wrede P. Structure optimization of an artificial neural filter detecting membrane-spanning amino acid sequences. Biopolymers 1996. [DOI: 10.1002/(sici)1097-0282(199601)38:1<13::aid-bip2>3.0.co;2-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
28
|
Lohmann R, Schneider G, Wrede P. Structure optimization of an artificial neural filter detecting membrane-spanning amino acid sequences. Biopolymers 1996; 38:13-29. [PMID: 8679941 DOI: 10.1002/(sici)1097-0282(199601)38:1%3c13::aid-bip2%3e3.0.co;2-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
An artificial neural network has been developed for the recognition and prediction of transmembrane regions in the amino acid sequences of human integral membrane proteins. It provides an additional prediction method besides the common hydrophobicity analysis by statistical means. Membrane/nonmembrane transition regions are predicted with 92% accuracy in both training and independent test data. The method used for the development of the neural filter is the algorithm of structure evolution. It subjects both the architecture and parameters of the system to a systematical optimization process and carries out local search in the respective structure and parameter spaces. The training technique of incomplete induction as part of the structure evolution provides for a comparatively general solution of the problem that is described by input-output relations only. Seven physiochemical side-chain properties were used to encode the amino acid sequences. It was found that geometric parameters like side-chain volume, bulkiness, or surface area are of minor importance. The properties polarity, refractivity, and hydrophobicity, however, turned out to support feature extraction. It is concluded that membrane transition regions in proteins are encoded in sequences as a characteristic feature based on the respective side-chain properties. The method of structure evolution is described in detail for this particular application and suggestions for further development of amino acid sequence filters are made.
Collapse
Affiliation(s)
- R Lohmann
- Gesellschaft zur Förderung angewandter Informatik (GFal), Berlin, Germany
| | | | | |
Collapse
|
29
|
Schneider G, Schuchhardt J, Wrede P. Development of simple fitness landscapes for peptides by artificial neural filter systems. BIOLOGICAL CYBERNETICS 1995; 73:245-254. [PMID: 7548312 DOI: 10.1007/bf00201426] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The applicability of artificial neural filter systems as fitness functions for sequence-oriented peptide design was evaluated. Two example applications were selected: classification of dipeptides according to their hydrophobicity and classification of proteolytic cleavage-sites of protein precursor sequences according to their mean hydrophobicities and mean side-chain volumes. The cleavage-sites covered 12 residues. In the dipeptide experiments the objective was to separate a selected set of molecules from all other possible dipeptide sequences. Perceptrons, feedforward networks with one hidden layer, and a hybrid network were applied. The filters were trained by a (1, lambda) evolution strategy. Two types of network units employing either a sigmoidal or a unimodal transfer function were used in the feedforward filters, and their influence on classification was investigated. The two-layer hybrid network employed gaussian activation functions. To analyze classification of the different filter systems, their output was plotted in the two-dimensional sequence space. The diagrams were interpreted as fitness landscapes qualifying the markedness of a characteristic peptide feature which can be used as a guide through sequence space for rational peptide design. It is demonstrated that the applicability of neural filter systems as a heuristic method for sequence optimization depends on both the appropriate network architecture and selection of representative sequence data. The networks with unimodal activation functions and the hybrid networks both led to a number of local optima. However, the hybrid networks produced the best prediction results. In contrast, the filters with sigmoidal activation produced good reclassification results leading to fitness landscapes lacking unreasonable local optima. Similar results were obtained for classification of both dipeptides and cleavage-site sequences.
Collapse
Affiliation(s)
- G Schneider
- Freie Universität Berlin, Universitätsklinikum Benjamin Franklin, Institut für Medizinische/Technische Physik und Lasermedizin (WE 19), Germany
| | | | | |
Collapse
|
30
|
Schneider G, Schuchhardt J, Wrede P. Peptide design in machina: development of artificial mitochondrial protein precursor cleavage sites by simulated molecular evolution. Biophys J 1995; 68:434-47. [PMID: 7696497 PMCID: PMC1281708 DOI: 10.1016/s0006-3495(95)80205-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Artificial neural networks were used for extraction of characteristic physiochemical features from mitochondrial matrix metalloprotease target sequences. The amino acid properties hydrophobicity and volume were used for sequence encoding. A window of 12 residues was employed, encompassing positions -7 to +5 of precursors with cleavage sites. Two sets of noncleavage site examples were selected for network training which was performed by an evolution strategy. The weight vectors of the optimized networks were visualized and interpreted by Hinton diagrams. A neural filter system consisting of 13 perceptron-type networks accurately classified the data. It served as the fitness function in a simulated molecular evolution procedure for sequence-oriented de novo design of idealized cleavage sites. A detailed description of the strategy is given. Several putative high-quality cleavage sites were obtained revealing the critical nature of the residues in the positions -2 and -5. Charged residues seem to have a major influence on cleavage site function.
Collapse
Affiliation(s)
- G Schneider
- Freie Universität Berlin, Institut für Medizinische/Technische Physik und Lasermedizin, AG Molekulare Bioinformatik, Germany
| | | | | |
Collapse
|