1
|
Tamposis IA, Tsirigos KD, Theodoropoulou MC, Kontou PI, Bagos PG. Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics 2020; 35:2208-2215. [PMID: 30445435 DOI: 10.1093/bioinformatics/bty910] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 10/29/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. RESULTS We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ioannis A Tamposis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Konstantinos D Tsirigos
- Department of Bio and Health Informatics, Technical University of Denmark, Kgs Lyngby, Denmark
| | | | - Panagiota I Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| |
Collapse
|
2
|
Partial proteolysis improves the identification of the extracellular segments of transmembrane proteins by surface biotinylation. Sci Rep 2020; 10:8880. [PMID: 32483232 PMCID: PMC7264363 DOI: 10.1038/s41598-020-65831-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 05/08/2020] [Indexed: 01/11/2023] Open
Abstract
Transmembrane proteins (TMP) play a crucial role in several physiological processes. Despite their importance and diversity, only a few TMP structures have been determined by high-resolution protein structure characterization methods so far. Due to the low number of determined TMP structures, the parallel development of various bioinformatics and experimental methods was necessary for their topological characterization. The combination of these methods is a powerful approach in the determination of TMP topology as in the Constrained Consensus TOPology prediction. To support the prediction, we previously developed a high-throughput topology characterization method based on primary amino group-labelling that is still limited in identifying all TMPs and their extracellular segments on the surface of a particular cell type. In order to generate more topology information, a new step, a partial proteolysis of the cell surface has been introduced to our method. This step results in new primary amino groups in the proteins that can be biotinylated with a membrane-impermeable agent while the cells still remain intact. Pre-digestion also promotes the emergence of modified peptides that are more suitable for MS/MS analysis. The modified sites can be utilized as extracellular constraints in topology predictions and may contribute to the refined topology of these proteins.
Collapse
|
3
|
Predicting Alpha Helical Transmembrane Proteins Using HMMs. Methods Mol Biol 2018; 1552:63-82. [PMID: 28224491 DOI: 10.1007/978-1-4939-6753-7_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
Alpha helical transmembrane (TM) proteins constitute an important structural class of membrane proteins involved in a wide variety of cellular functions. The prediction of their transmembrane topology, as well as their discrimination in newly sequenced genomes, is of great importance for the elucidation of their structure and function. Several methods have been applied for the prediction of the transmembrane segments and the topology of alpha helical transmembrane proteins utilizing different algorithmic techniques. Hidden Markov Models (HMMs) have been efficiently used in the development of several computational methods used for this task. In this chapter we give a brief review of different available prediction methods for alpha helical transmembrane proteins pointing out sequence and structural features that should be incorporated in a prediction method. We then describe the procedure of the design and development of a Hidden Markov Model capable of predicting the transmembrane alpha helices in proteins and discriminating them from globular proteins.
Collapse
|
4
|
Membrane proteins structures: A review on computational modeling tools. BIOCHIMICA ET BIOPHYSICA ACTA-BIOMEMBRANES 2017; 1859:2021-2039. [DOI: 10.1016/j.bbamem.2017.07.008] [Citation(s) in RCA: 62] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Revised: 07/04/2017] [Accepted: 07/13/2017] [Indexed: 01/02/2023]
|
5
|
Abstract
In sessile plants, the dynamic protein secretion pathways orchestrate the cellular responses to internal signals and external environmental changes in almost every aspect of plant developmental events. The cohort of plant proteins, secreted from the plant cells into the extracellular matrix, has been annotated as plant secretome. Therefore, the identification and characterization of secreted proteins will discover novel secretory potentials and establish the functional connection between cellular protein secretion and plant physiological phenomena. Noteworthy, an increasing number of bioinformatics databases and tools have been developed for computational predictions on either secreted proteins or secretory pathways. This chapter summarizes current accessible databases and tools for protein secretion analysis in Arabidopsis thaliana and higher plants, and provides feasible methodologies for bioinformatics analysis of secretome studies for the plant research community.
Collapse
Affiliation(s)
- Liyuan Chen
- RGC-AoE Centre for Organelle Biogenesis and Function, School of Life Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China.
| |
Collapse
|
6
|
|
7
|
Holliday GL, Bairoch A, Bagos PG, Chatonnet A, Craik DJ, Finn RD, Henrissat B, Landsman D, Manning G, Nagano N, O’Donovan C, Pruitt KD, Rawlings ND, Saier M, Sowdhamini R, Spedding M, Srinivasan N, Vriend G, Babbitt PC, Bateman A. Key challenges for the creation and maintenance of specialist protein resources. Proteins 2015; 83:1005-13. [PMID: 25820941 PMCID: PMC4446195 DOI: 10.1002/prot.24803] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Revised: 03/06/2015] [Accepted: 03/20/2015] [Indexed: 11/12/2022]
Abstract
As the volume of data relating to proteins increases, researchers rely more and more on the analysis of published data, thus increasing the importance of good access to these data that vary from the supplemental material of individual articles, all the way to major reference databases with professional staff and long-term funding. Specialist protein resources fill an important middle ground, providing interactive web interfaces to their databases for a focused topic or family of proteins, using specialized approaches that are not feasible in the major reference databases. Many are labors of love, run by a single lab with little or no dedicated funding and there are many challenges to building and maintaining them. This perspective arose from a meeting of several specialist protein resources and major reference databases held at the Wellcome Trust Genome Campus (Cambridge, UK) on August 11 and 12, 2014. During this meeting some common key challenges involved in creating and maintaining such resources were discussed, along with various approaches to address them. In laying out these challenges, we aim to inform users about how these issues impact our resources and illustrate ways in which our working together could enhance their accuracy, currency, and overall value.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of CaliforniaSan Francisco, California, 94158
| | - Amos Bairoch
- SIB—Swiss Institute of Bioinformatics, University of GenevaGeneva, Switzerland
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of ThessalyLamia, 35100, Greece
| | - Arnaud Chatonnet
- INRA, Umr866 Dynamique Musculaire Et MétabolismeMontpellier, F-34000, France
- Université MontpellierMontpellier, F-34000, France
| | - David J Craik
- Institute for Molecular Bioscience. The University of QueenslandBrisbane, Queensland, 4072, Australia
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Bernard Henrissat
- Architecture Et Fonction Des Macromolécules Biologiques, CNRS, Aix-Marseille UniversitéMarseille, 13288, France
- Department of Biological Sciences, King Abdulaziz UniversityJeddah, Saudi Arabia
| | - David Landsman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, Maryland, 20892
| | - Gerard Manning
- Department of Bioinformatics & Computational Biology, Genentech1 DNA Way, South San Francisco, California, 98010
| | - Nozomi Nagano
- Computational Biology Research Center, National Institute of Advanced Industrial Science and TechnologyTokyo, 135-0064, Japan
| | - Claire O’Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, Maryland, 20892
| | - Neil D Rawlings
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
- Wellcome Trust Sanger InstituteWellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Milton Saier
- Department of Molecular Biology, University of California at San DiegoLa Jolla, California, 92093
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, TIFRGKVK Campus, Bellary Road, Bangalore, 560065, India
| | - Michael Spedding
- Chair NC-IUPHAR, Spedding Research Solutions SARL6 Rue Ampere, Le Vesinet, 78110, France
| | | | - Gert Vriend
- Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Center, Geert Grooteplein Zuid 26-28, 6525 GANijmegen, The Netherlands
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of CaliforniaSan Francisco, California, 94158
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| |
Collapse
|
8
|
Leman JK, Ulmschneider MB, Gray JJ. Computational modeling of membrane proteins. Proteins 2015; 83:1-24. [PMID: 25355688 PMCID: PMC4270820 DOI: 10.1002/prot.24703] [Citation(s) in RCA: 81] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Revised: 10/01/2014] [Accepted: 10/18/2014] [Indexed: 02/06/2023]
Abstract
The determination of membrane protein (MP) structures has always trailed that of soluble proteins due to difficulties in their overexpression, reconstitution into membrane mimetics, and subsequent structure determination. The percentage of MP structures in the protein databank (PDB) has been at a constant 1-2% for the last decade. In contrast, over half of all drugs target MPs, only highlighting how little we understand about drug-specific effects in the human body. To reduce this gap, researchers have attempted to predict structural features of MPs even before the first structure was experimentally elucidated. In this review, we present current computational methods to predict MP structure, starting with secondary structure prediction, prediction of trans-membrane spans, and topology. Even though these methods generate reliable predictions, challenges such as predicting kinks or precise beginnings and ends of secondary structure elements are still waiting to be addressed. We describe recent developments in the prediction of 3D structures of both α-helical MPs as well as β-barrels using comparative modeling techniques, de novo methods, and molecular dynamics (MD) simulations. The increase of MP structures has (1) facilitated comparative modeling due to availability of more and better templates, and (2) improved the statistics for knowledge-based scoring functions. Moreover, de novo methods have benefited from the use of correlated mutations as restraints. Finally, we outline current advances that will likely shape the field in the forthcoming decade.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Martin B. Ulmschneider
- Department of Materials Science and Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
9
|
Fluman N, Navon S, Bibi E, Pilpel Y. mRNA-programmed translation pauses in the targeting of E. coli membrane proteins. eLife 2014; 3. [PMID: 25135940 PMCID: PMC4359368 DOI: 10.7554/elife.03440] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 08/16/2014] [Indexed: 02/05/2023] Open
Abstract
In all living organisms, ribosomes translating membrane proteins are targeted to membrane translocons early in translation, by the ubiquitous signal recognition particle (SRP) system. In eukaryotes, the SRP Alu domain arrests translation elongation of membrane proteins until targeting is complete. Curiously, however, the Alu domain is lacking in most eubacteria. In this study, by analyzing genome-wide data on translation rates, we identified a potential compensatory mechanism in E. coli that serves to slow down the translation during membrane protein targeting. The underlying mechanism is likely programmed into the coding sequence, where Shine-Dalgarno-like elements trigger elongation pauses at strategic positions during the early stages of translation. We provide experimental evidence that slow translation during targeting and improves membrane protein production fidelity, as it correlates with better folding of overexpressed membrane proteins. Thus, slow elongation is important for membrane protein targeting in E. coli, which utilizes mechanisms different from the eukaryotic one to control the translation speed.
Collapse
Affiliation(s)
- Nir Fluman
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| | - Sivan Navon
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| | - Eitan Bibi
- Department of Biological Chemistry, Weizmann Institute of Science, Rehovot, Israel
| | - Yitzhak Pilpel
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
10
|
HMMpTM: improving transmembrane protein topology prediction using phosphorylation and glycosylation site prediction. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1844:316-22. [PMID: 24225132 DOI: 10.1016/j.bbapap.2013.11.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Revised: 11/02/2013] [Accepted: 11/04/2013] [Indexed: 11/22/2022]
Abstract
During the last two decades a large number of computational methods have been developed for predicting transmembrane protein topology. Current predictors rely on topogenic signals in the protein sequence, such as the distribution of positively charged residues in extra-membrane loops and the existence of N-terminal signals. However, phosphorylation and glycosylation are post-translational modifications (PTMs) that occur in a compartment-specific manner and therefore the presence of a phosphorylation or glycosylation site in a transmembrane protein provides topological information. We examine the combination of phosphorylation and glycosylation site prediction with transmembrane protein topology prediction. We report the development of a Hidden Markov Model based method, capable of predicting the topology of transmembrane proteins and the existence of kinase specific phosphorylation and N/O-linked glycosylation sites along the protein sequence. Our method integrates a novel feature in transmembrane protein topology prediction, which results in improved performance for topology prediction and reliable prediction of phosphorylation and glycosylation sites. The method is freely available at http://bioinformatics.biol.uoa.gr/HMMpTM.
Collapse
|
11
|
Abstract
Background Membrane proteins perform essential roles in diverse cellular functions and are regarded as major pharmaceutical targets. The significance of membrane proteins has led to the developing dozens of resources related with membrane proteins. However, most of these resources are built for specific well-known membrane protein groups, making it difficult to find common and specific features of various membrane protein groups. Methods We collected human membrane proteins from the dispersed resources and predicted novel membrane protein candidates by using ortholog information and our membrane protein classifiers. The membrane proteins were classified according to the type of interaction with the membrane, subcellular localization, and molecular function. We also made new feature dataset to characterize the membrane proteins in various aspects including membrane protein topology, domain, biological process, disease, and drug. Moreover, protein structure and ICD-10-CM based integrated disease and drug information was newly included. To analyze the comprehensive information of membrane proteins, we implemented analysis tools to identify novel sequence and functional features of the classified membrane protein groups and to extract features from protein sequences. Results We constructed HMPAS with 28,509 collected known membrane proteins and 8,076 newly predicted candidates. This system provides integrated information of human membrane proteins individually and in groups organized by 45 subcellular locations and 1,401 molecular functions. As a case study, we identified associations between the membrane proteins and diseases and present that membrane proteins are promising targets for diseases related with nervous system and circulatory system. A web-based interface of this system was constructed to facilitate researchers not only to retrieve organized information of individual proteins but also to use the tools to analyze the membrane proteins. Conclusions HMPAS provides comprehensive information about human membrane proteins including specific features of certain membrane protein groups. In this system, user can acquire the information of individual proteins and specified groups focused on their conserved sequence features, involved cellular processes, and diseases. HMPAS may contribute as a valuable resource for the inference of novel cellular mechanisms and pharmaceutical targets associated with the human membrane proteins. HMPAS is freely available at http://fcode.kaist.ac.kr/hmpas.
Collapse
|
12
|
Gypas F, Tsaousis GN, Hamodrakas SJ. mpMoRFsDB: a database of molecular recognition features in membrane proteins. Bioinformatics 2013; 29:2517-8. [DOI: 10.1093/bioinformatics/btt427] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
|
13
|
Naganathan S, Ye S, Sakmar TP, Huber T. Site-specific epitope tagging of G protein-coupled receptors by bioorthogonal modification of a genetically encoded unnatural amino acid. Biochemistry 2013; 52:1028-36. [PMID: 23317030 DOI: 10.1021/bi301292h] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We developed a general strategy for labeling expressed membrane proteins with a peptide epitope tag and detecting the tagged proteins in native cellular membranes. First, we genetically encoded the unnatural amino acid p-azido-L-phenylalanine (azF) at various specific sites in a G protein-coupled receptor (GPCR), C-C chemokine receptor 5 (CCR5). The reactive azido moiety facilitates Staudinger ligation to a triarylphosphine-conjugated FLAG peptide. We then developed a whole-cell-based enzyme-linked immunosorbent assay approach to detect the modified azF-CCR5 using anti-FLAG mAb. We optimized conditions to achieve labeling and detection of low-abundance GPCRs in live cells. We also performed an accessibility screen to identify azF positions on CCR5 amenable to labeling. Finally, we demonstrate a preparative strategy for obtaining pure bioorthogonally modified GPCRs suitable for single-molecule detection fluorescence experiments. This peptide epitope tagging strategy, which employs genetic encoding and bioorthogonal labeling of azF in live cells, should be useful for studying biogenesis of polytopic membrane proteins and GPCR signaling mechanisms.
Collapse
Affiliation(s)
- Saranga Naganathan
- Laboratory of Chemical Biology and Signal Transduction, The Rockefeller University, 1230 York Avenue, New York, NY 10065, USA
| | | | | | | |
Collapse
|
14
|
Andreopoulos B, Labudde D. Efficient unfolding pattern recognition in single molecule force spectroscopy data. Algorithms Mol Biol 2011; 6:16. [PMID: 21645400 PMCID: PMC3126767 DOI: 10.1186/1748-7188-6-16] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2010] [Accepted: 06/06/2011] [Indexed: 11/20/2022] Open
Abstract
Background Single-molecule force spectroscopy (SMFS) is a technique that measures the force necessary to unfold a protein. SMFS experiments generate Force-Distance (F-D) curves. A statistical analysis of a set of F-D curves reveals different unfolding pathways. Information on protein structure, conformation, functional states, and inter- and intra-molecular interactions can be derived. Results In the present work, we propose a pattern recognition algorithm and apply our algorithm to datasets from SMFS experiments on the membrane protein bacterioRhodopsin (bR). We discuss the unfolding pathways found in bR, which are characterised by main peaks and side peaks. A main peak is the result of the pairwise unfolding of the transmembrane helices. In contrast, a side peak is an unfolding event in the alpha-helix or other secondary structural element. The algorithm is capable of detecting side peaks along with main peaks. Therefore, we can detect the individual unfolding pathway as the sequence of events labeled with their occurrences and co-occurrences special to bR's unfolding pathway. We find that side peaks do not co-occur with one another in curves as frequently as main peaks do, which may imply a synergistic effect occurring between helices. While main peaks co-occur as pairs in at least 50% of curves, the side peaks co-occur with one another in less than 10% of curves. Moreover, the algorithm runtime scales well as the dataset size increases. Conclusions Our algorithm satisfies the requirements of an automated methodology that combines high accuracy with efficiency in analyzing SMFS datasets. The algorithm tackles the force spectroscopy analysis bottleneck leading to more consistent and reproducible results.
Collapse
|