1
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
2
|
Gillani M, Pollastri G. SCLpred-ECL: Subcellular Localization Prediction by Deep N-to-1 Convolutional Neural Networks. Int J Mol Sci 2024; 25:5440. [PMID: 38791479 PMCID: PMC11121631 DOI: 10.3390/ijms25105440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 05/09/2024] [Accepted: 05/11/2024] [Indexed: 05/26/2024] Open
Abstract
The subcellular location of a protein provides valuable insights to bioinformaticians in terms of drug designs and discovery, genomics, and various other aspects of medical research. Experimental methods for protein subcellular localization determination are time-consuming and expensive, whereas computational methods, if accurate, would represent a much more efficient alternative. This article introduces an ab initio protein subcellular localization predictor based on an ensemble of Deep N-to-1 Convolutional Neural Networks. Our predictor is trained and tested on strict redundancy-reduced datasets and achieves 63% accuracy for the diverse number of classes. This predictor is a step towards bridging the gap between a protein sequence and the protein's function. It can potentially provide information about protein-protein interaction to facilitate drug design and processes like vaccine production that are essential to disease prevention.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland;
| | | |
Collapse
|
3
|
Özsarı G, Rifaioglu AS, Atakan A, Doğan T, Martin MJ, Çetin Atalay R, Atalay V. SLPred: a multi-view subcellular localization prediction tool for multi-location human proteins. Bioinformatics 2022; 38:4226-4229. [PMID: 35801913 DOI: 10.1093/bioinformatics/btac458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 06/08/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
SUMMARY Accurate prediction of the subcellular locations (SLs) of proteins is a critical topic in protein science. In this study, we present SLPred, an ensemble-based multi-view and multi-label protein subcellular localization prediction tool. For a query protein sequence, SLPred provides predictions for nine main SLs using independent machine-learning models trained for each location. We used UniProtKB/Swiss-Prot human protein entries and their curated SL annotations as our source data. We connected all disjoint terms in the UniProt SL hierarchy based on the corresponding term relationships in the cellular component category of Gene Ontology and constructed a training dataset that is both reliable and large scale using the re-organized hierarchy. We tested SLPred on multiple benchmarking datasets including our-in house sets and compared its performance against six state-of-the-art methods. Results indicated that SLPred outperforms other tools in the majority of cases. AVAILABILITY AND IMPLEMENTATION SLPred is available both as an open-access and user-friendly web-server (https://slpred.kansil.org) and a stand-alone tool (https://github.com/kansil/SLPred). All datasets used in this study are also available at https://slpred.kansil.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gökhan Özsarı
- Department of Computer Engineering, Middle East Technical University, Ankara 06800, Turkey.,Department of Computer Engineering, Niğde Ömer Halisdemir University, Niğde 51240, Turkey
| | - Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, İskenderun Technical University, Hatay 31200, Turkey.,Faculty of Medicine, Institute for Computational Biomedicine, Heidelberg University and Heidelberg University Hospital, Heidelberg 69120, Germany
| | - Ahmet Atakan
- Department of Computer Engineering, Middle East Technical University, Ankara 06800, Turkey.,Department of Computer Engineering, Erzincan Binali Yıldırım University, Erzincan 24002, Turkey
| | - Tunca Doğan
- Department of Computer Engineering, Hacettepe University, Ankara 06800, Turkey
| | - Maria Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, Hinxton CB10 1SD, UK
| | - Rengül Çetin Atalay
- Graduate School of Informatics Middle East Technical University, Ankara 06800, Turkey.,Section of Pulmonary and Critical Care Medicine, the University of Chicago, Chicago, IL 60637, USA
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara 06800, Turkey
| |
Collapse
|
4
|
Kliuchnikov E, Klyshko E, Kelly MS, Zhmurov A, Dima RI, Marx KA, Barsegov V. Microtubule assembly and disassembly dynamics model: Exploring dynamic instability and identifying features of Microtubules' Growth, Catastrophe, Shortening, and Rescue. Comput Struct Biotechnol J 2022; 20:953-974. [PMID: 35242287 PMCID: PMC8861655 DOI: 10.1016/j.csbj.2022.01.028] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 01/26/2022] [Accepted: 01/27/2022] [Indexed: 12/21/2022] Open
Abstract
Microtubules (MTs), a cellular structure element, exhibit dynamic instability and can switch stochastically from growth to shortening; but the factors that trigger these processes at the molecular level are not understood. We developed a 3D Microtubule Assembly and Disassembly DYnamics (MADDY) model, based upon a bead-per-monomer representation of the αβ-tubulin dimers forming an MT lattice, stabilized by the lateral and longitudinal interactions between tubulin subunits. The model was parameterized against the experimental rates of MT growth and shortening, and pushing forces on the Dam1 protein complex due to protofilaments splaying out. Using the MADDY model, we carried out GPU-accelerated Langevin simulations to access dynamic instability behavior. By applying Machine Learning techniques, we identified the MT characteristics that distinguish simultaneously all four kinetic states: growth, catastrophe, shortening, and rescue. At the cellular 25 μM tubulin concentration, the most important quantities are the MT length L , average longitudinal curvatureκ long , MT tip width w , total energy of longitudinal interactions in MT latticeU long , and the energies of longitudinal and lateral interactions required to complete MT to full cylinderU long add andU lat add . At high 250 μM tubulin concentration, the most important characteristics are L ,κ long , number of hydrolyzed αβ-tubulin dimersn hyd and number of lateral interactions per helical pitchn lat in MT lattice, energy of lateral interactions in MT latticeU lat , and energy of longitudinal interactions in MT tipu long . These results allow greater insights into what brings about kinetic state stability and the transitions between states involved in MT dynamic instability behavior.
Collapse
Affiliation(s)
| | - Eugene Klyshko
- Department of Chemistry, University of Massachusetts, Lowell, MA 01854, USA
| | - Maria S. Kelly
- Department of Chemistry, University of Cincinnati, Cincinnati, OH 45221, USA
| | - Artem Zhmurov
- KTH Royal Institute of Technology, Stockholm, Sweden
| | - Ruxandra I. Dima
- Department of Chemistry, University of Cincinnati, Cincinnati, OH 45221, USA
| | - Kenneth A. Marx
- Department of Chemistry, University of Massachusetts, Lowell, MA 01854, USA
| | - Valeri Barsegov
- Department of Chemistry, University of Massachusetts, Lowell, MA 01854, USA
| |
Collapse
|
5
|
Jiang Y, Wang D, Wang W, Xu D. Computational methods for protein localization prediction. Comput Struct Biotechnol J 2021; 19:5834-5844. [PMID: 34765098 PMCID: PMC8564054 DOI: 10.1016/j.csbj.2021.10.023] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 12/16/2022] Open
Abstract
The accurate annotation of protein localization is crucial in understanding protein function in tandem with a broad range of applications such as pathological analysis and drug design. Since most proteins do not have experimentally-determined localization information, the computational prediction of protein localization has been an active research area for more than two decades. In particular, recent machine-learning advancements have fueled the development of new methods in protein localization prediction. In this review paper, we first categorize the main features and algorithms used for protein localization prediction. Then, we summarize a list of protein localization prediction tools in terms of their coverage, characteristics, and accessibility to help users find suitable tools based on their needs. Next, we evaluate some of these tools on a benchmark dataset. Finally, we provide an outlook on the future exploration of protein localization methods.
Collapse
Affiliation(s)
- Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Weiwei Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
6
|
Cui T, Dou Y, Tan P, Ni Z, Liu T, Wang D, Huang Y, Cai K, Zhao X, Xu D, Lin H, Wang D. RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res 2021; 50:D333-D339. [PMID: 34551440 PMCID: PMC8728251 DOI: 10.1093/nar/gkab825] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 09/03/2021] [Accepted: 09/07/2021] [Indexed: 12/16/2022] Open
Abstract
Resolving the spatial distribution of the transcriptome at a subcellular level can increase our understanding of biology and diseases. To facilitate studies of biological functions and molecular mechanisms in the transcriptome, we updated RNALocate, a resource for RNA subcellular localization analysis that is freely accessible at http://www.rnalocate.org/ or http://www.rna-society.org/rnalocate/. Compared to RNALocate v1.0, the new features in version 2.0 include (i) expansion of the data sources and the coverage of species; (ii) incorporation and integration of RNA-seq datasets containing information about subcellular localization; (iii) addition and reorganization of RNA information (RNA subcellular localization conditions and descriptive figures for method, RNA homology information, RNA interaction and ncRNA disease information) and (iv) three additional prediction tools: DM3Loc, iLoc-lncRNA and iLoc-mRNA. Overall, RNALocate v2.0 provides a comprehensive RNA subcellular localization resource for researchers to deconvolute the highly complex architecture of the cell.
Collapse
Affiliation(s)
- Tianyu Cui
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Yiying Dou
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Puwen Tan
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Zhen Ni
- Department of Thoracic Surgery, Nanfang Hospital, Southern Medical University, Guangzhou 510515, China
| | - Tianyuan Liu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - DuoLin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA
| | - Yan Huang
- Shunde Hospital, Southern Medical University (The First People's Hospital of Shunde Foshan), Foshan 528308, China
| | - Kaican Cai
- Department of Thoracic Surgery, Nanfang Hospital, Southern Medical University, Guangzhou 510515, China
| | - Xiaoyang Zhao
- State Key Laboratory of Organ Failure Research, Department of Developmental Biology, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Dong Wang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.,Dermatology Hospital, Southern Medical University, Guangzhou 510091, China
| |
Collapse
|
7
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
8
|
Imai K, Nakai K. Tools for the Recognition of Sorting Signals and the Prediction of Subcellular Localization of Proteins From Their Amino Acid Sequences. Front Genet 2020; 11:607812. [PMID: 33324450 PMCID: PMC7723863 DOI: 10.3389/fgene.2020.607812] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 11/03/2020] [Indexed: 12/13/2022] Open
Abstract
At the time of translation, nascent proteins are thought to be sorted into their final subcellular localization sites, based on the part of their amino acid sequences (i.e., sorting or targeting signals). Thus, it is interesting to computationally recognize these signals from the amino acid sequences of any given proteins and to predict their final subcellular localization with such information, supplemented with additional information (e.g., k-mer frequency). This field has a long history and many prediction tools have been released. Even in this era of proteomic atlas at the single-cell level, researchers continue to develop new algorithms, aiming at accessing the impact of disease-causing mutations/cell type-specific alternative splicing, for example. In this article, we overview the entire field and discuss its future direction.
Collapse
Affiliation(s)
- Kenichiro Imai
- Cellular and Molecular Biotechnology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kenta Nakai
- The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
9
|
Xu YY, Zhou H, Murphy RF, Shen HB. Consistency and variation of protein subcellular location annotations. Proteins 2020; 89:242-250. [PMID: 32935893 DOI: 10.1002/prot.26010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Revised: 07/09/2020] [Accepted: 09/13/2020] [Indexed: 11/09/2022]
Abstract
A major challenge for protein databases is reconciling information from diverse sources. This is especially difficult when some information consists of secondary, human-interpreted rather than primary data. For example, the Swiss-Prot database contains curated annotations of subcellular location that are based on predictions from protein sequence, statements in scientific articles, and published experimental evidence. The Human Protein Atlas (HPA) consists of millions of high-resolution microscopic images that show protein spatial distribution on a cellular and subcellular level. These images are manually annotated with protein subcellular locations by trained experts. The image annotations in HPA can capture the variation of subcellular location across different cell lines, tissues, or tissue states. Systematic investigation of the consistency between HPA and Swiss-Prot assignments of subcellular location, which is important for understanding and utilizing protein location data from the two databases, has not been described previously. In this paper, we quantitatively evaluate the consistency of subcellular location annotations between HPA and Swiss-Prot at multiple levels, as well as variation of protein locations across cell lines and tissues. Our results show that annotations of these two databases differ significantly in many cases, leading to proposed procedures for deriving and integrating the protein subcellular location data. We also find that proteins having highly variable locations are more likely to be biomarkers of diseases, providing support for incorporating analysis of subcellular location in protein biomarker identification and screening.
Collapse
Affiliation(s)
- Ying-Ying Xu
- School of Biomedical Engineering, Southern Medical University, Guangzhou, China.,Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Hang Zhou
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - Robert F Murphy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
10
|
Giordana L, Nowicki C. Two phylogenetically divergent isocitrate dehydrogenases are encoded in Leishmania parasites. Molecular and functional characterization of Leishmania mexicana isoenzymes with specificity towards NAD + and NADP .. Mol Biochem Parasitol 2020; 240:111320. [PMID: 32980452 DOI: 10.1016/j.molbiopara.2020.111320] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 08/26/2020] [Accepted: 08/27/2020] [Indexed: 10/23/2022]
Abstract
Leishmania parasites are of great relevance to public health because they are the causative agents of various long-term and health-threatening diseases in humans. Dependent on the manifestation, drugs either require difficult and lengthy administration, are toxic, expensive, not very effective or have lost efficacy due to the resistance developed by these pathogens against clinical treatments. The intermediary metabolism of Leishmania parasites is characterized by several unusual features, among which whether the Krebs cycle operates in a cyclic and/or in a non-cyclic mode is included. Our survey of the genomes of Leishmania species and monoxenous parasites such as those of the genera Crithidia and Leptomonas (http://www.tritrypdb.org) revealed that two genes encoding putative isocitrate dehydrogenases (IDHs) -with distantly related sequences- are strictly conserved among these parasites. Thus, in this study, we aimed to functionally characterize the two leishmanial IDH isoenzymes, for which we selected the genes LmxM10.0290 (Lmex_IDH-90) and LmxM32.2550 (Lmex_IDH-50) from L. mexicana. Phylogenetic analysis showed that Lmex_IDH-50 clustered with members of Subfamily I, which contains mainly archaeal and bacterial IDHs, and that Lmex_IDH-90 was a close relative of eukaryotic enzymes comprised within Subfamily II IDHs. 3-D homology modeling predicted that both IDHs exhibited the typical folding motifs recognized as canonical for prokaryotic and eukaryotic counterparts, respectively. Both IDH isoforms displayed dual subcellular localization, in the cytosol and the mitochondrion. Kinetic studies showed that Lmex_IDH-50 exclusively catalyzed the reduction of NAD+, while Lmex_IDH-90 solely used NADP+ as coenzyme. Besides, Lmex_IDH-50 differed from Lmex_IDH-90 by exhibiting a nearly 20-fold lower apparent Km value towards isocitrate (2.0 μM vs 43 μM). Our findings showed, for the first time, that the genus Leishmania differentiates not only from other trypanosomatids such as Trypanosoma cruzi and Trypanosoma brucei, but also from most living organisms, by exhibiting two functional homo-dimeric IDHs, highly specific towards NAD+ and NADP+, respectively. It is tempting to argue that any or both types of IDHs might be directly or indirectly linked to the Krebs cycle and/or to the de novo synthesis of glutamate. Our results about the biochemical and structural features of leishmanial IDHs show the relevance of deepening our knowledge of the metabolic processes in these pathogenic parasites to potentially identify new therapeutic targets.
Collapse
Affiliation(s)
- Lucila Giordana
- Universidad de Buenos Aires, Facultad de Farmacia y Bioquímica, Instituto de Química y Fisicoquímica Biológica (IQUIFIB-CONICET), Junín 956, C1113AAD, Buenos Aires, Argentina
| | - Cristina Nowicki
- Universidad de Buenos Aires, Facultad de Farmacia y Bioquímica, Instituto de Química y Fisicoquímica Biológica (IQUIFIB-CONICET), Junín 956, C1113AAD, Buenos Aires, Argentina.
| |
Collapse
|
11
|
Large-scale prediction and analysis of protein sub-mitochondrial localization with DeepMito. BMC Bioinformatics 2020; 21:266. [PMID: 32938368 PMCID: PMC7493403 DOI: 10.1186/s12859-020-03617-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 06/18/2020] [Indexed: 12/31/2022] Open
Abstract
Background The prediction of protein subcellular localization is a key step of the big effort towards protein functional annotation. Many computational methods exist to identify high-level protein subcellular compartments such as nucleus, cytoplasm or organelles. However, many organelles, like mitochondria, have their own internal compartmentalization. Knowing the precise location of a protein inside mitochondria is crucial for its accurate functional characterization. We recently developed DeepMito, a new method based on a 1-Dimensional Convolutional Neural Network (1D-CNN) architecture outperforming other similar approaches available in literature. Results Here, we explore the adoption of DeepMito for the large-scale annotation of four sub-mitochondrial localizations on mitochondrial proteomes of five different species, including human, mouse, fly, yeast and Arabidopsis thaliana. A significant fraction of the proteins from these organisms lacked experimental information about sub-mitochondrial localization. We adopted DeepMito to fill the gap, providing complete characterization of protein localization at sub-mitochondrial level for each protein of the five proteomes. Moreover, we identified novel mitochondrial proteins fishing on the set of proteins lacking any subcellular localization annotation using available state-of-the-art subcellular localization predictors. We finally performed additional functional characterization of proteins predicted by DeepMito as localized into the four different sub-mitochondrial compartments using both available experimental and predicted GO terms. All data generated in this study were collected into a database called DeepMitoDB (available at http://busca.biocomp.unibo.it/deepmitodb), providing complete functional characterization of 4307 mitochondrial proteins from the five species. Conclusions DeepMitoDB offers a comprehensive view of mitochondrial proteins, including experimental and predicted fine-grain sub-cellular localization and annotated and predicted functional annotations. The database complements other similar resources providing characterization of new proteins. Furthermore, it is also unique in including localization information at the sub-mitochondrial level. For this reason, we believe that DeepMitoDB can be a valuable resource for mitochondrial research.
Collapse
|
12
|
Pan X, Lu L, Cai YD. Predicting protein subcellular location with network embedding and enrichment features. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140477. [PMID: 32593761 DOI: 10.1016/j.bbapap.2020.140477] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 06/17/2020] [Accepted: 06/22/2020] [Indexed: 02/06/2023]
Abstract
The subcellular location of a protein is highly related to its function. Identifying the location of a given protein is an essential step for investigating its related problems. Traditional experimental methods can produce solid determination. However, their limitations, such as high cost and low efficiency, are evident. Computational methods provide an alternative means to address these problems. Most previous methods constantly extract features from protein sequences or structures for building prediction models. In this study, we use two types of features and combine them to construct the model. The first feature type is extracted from a protein-protein interaction network to abstract the relationship between the encoded protein and other proteins. The second type is obtained from gene ontology and biological pathways to indicate the existing functions of the encoded protein. These features are analyzed using some feature selection methods. The final optimum features are adopted to build the model with recurrent neural network as the classification algorithm. Such model yields good performance with Matthews correlation coefficient of 0.844. A decision tree is used as a rule learning classifier to extract decision rules. Although the performance of decision rules is poor, they are valuable in revealing the molecular mechanism of proteins with different subcellular locations. The final analysis confirms the reliability of the extracted rules. The source code of the propose method is freely available at https://github.com/xypan1232/rnnloc.
Collapse
Affiliation(s)
- Xiaoyong Pan
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China; Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, NewYork, NY, 10032, USA.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, People's Republic of China.
| |
Collapse
|
13
|
Muthye V, Kandoi G, Lavrov DV. MMPdb and MitoPredictor: Tools for facilitating comparative analysis of animal mitochondrial proteomes. Mitochondrion 2020; 51:118-125. [PMID: 31972373 DOI: 10.1016/j.mito.2020.01.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 12/09/2019] [Accepted: 01/02/2020] [Indexed: 11/24/2022]
Abstract
Data on experimentally-characterized animal mitochondrial proteomes (mt-proteomes) are limited to a few model organisms and are scattered across multiple databases, impeding a comparative analysis. We developed two resources to address these problems. First, we re-analyzed proteomic data from six species with experimentally characterized mt-proteomes: animals (Homo sapiens, Mus musculus, Caenorhabditis elegans, and Drosophila melanogaster), and outgroups (Acanthamoeba castellanii and Saccharomyces cerevisiae) and created the Metazoan Mitochondrial Proteome Database (MMPdb) to host the results. Second, we developed a novel pipeline, "MitoPredictor" that uses a Random Forest classifier to infer mitochondrial localization of proteins based on orthology, mitochondrial targeting signal prediction, and protein domain analyses. Both tools generate an R Shiny applet that can be used to visualize and interact with the results and can be used on a personal computer. MMPdb is also available online at https://mmpdb.eeob.iastate.edu/.
Collapse
Affiliation(s)
- Viraj Muthye
- Bioinformatics and Computational Biology Program, Iowa State University, 2014 Molecular Biology Building, Ames, Iowa 50011, USA; Department of Ecology, Evolution and Organismal Biology, Iowa State University, 251 Bessey Hall, 2200 Osborne Drive, Ames, Iowa 50011, USA.
| | - Gaurav Kandoi
- Bioinformatics and Computational Biology Program, Iowa State University, 2014 Molecular Biology Building, Ames, Iowa 50011, USA; Department of Electrical and Computer Engineering, Iowa State University, 2520 Osborn Drive, Ames, IA 50011, USA
| | - Dennis V Lavrov
- Bioinformatics and Computational Biology Program, Iowa State University, 2014 Molecular Biology Building, Ames, Iowa 50011, USA; Department of Ecology, Evolution and Organismal Biology, Iowa State University, 251 Bessey Hall, 2200 Osborne Drive, Ames, Iowa 50011, USA
| |
Collapse
|
14
|
Nielsen H, Petsalaki EI, Zhao L, Stühler K. Predicting eukaryotic protein secretion without signals. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2019; 1867:140174. [DOI: 10.1016/j.bbapap.2018.11.011] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2018] [Revised: 10/30/2018] [Accepted: 11/29/2018] [Indexed: 10/27/2022]
|
15
|
Yang F, Liu Y, Wang Y, Yin Z, Yang Z. MIC_Locator: a novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy. BMC Bioinformatics 2019; 20:522. [PMID: 31655541 PMCID: PMC6815465 DOI: 10.1186/s12859-019-3136-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 10/09/2019] [Indexed: 12/20/2022] Open
Abstract
Background Protein subcellular localization plays a crucial role in understanding cell function. Proteins need to be in the right place at the right time, and combine with the corresponding molecules to fulfill their functions. Furthermore, prediction of protein subcellular location not only should be a guiding role in drug design and development due to potential molecular targets but also be an essential role in genome annotation. Taking the current status of image-based protein subcellular localization as an example, there are three common drawbacks, i.e., obsolete datasets without updating label information, stereotypical feature descriptor on spatial domain or grey level, and single-function prediction algorithm’s limited capacity of handling single-label database. Results In this paper, a novel human protein subcellular localization prediction model MIC_Locator is proposed. Firstly, the latest datasets are collected and collated as our benchmark dataset instead of obsolete data while training prediction model. Secondly, Fourier transformation, Riesz transformation, Log-Gabor filter and intensity coding strategy are employed to obtain frequency feature based on three components of monogenic signal with different frequency scales. Thirdly, a chained prediction model is proposed to handle multi-label instead of single-label datasets. The experiment results showed that the MIC_Locator can achieve 60.56% subset accuracy and outperform the existing majority of prediction models, and the frequency feature and intensity coding strategy can be conducive to improving the classification accuracy. Conclusions Our results demonstrate that the frequency feature is more beneficial for improving the performance of model compared to features extracted from spatial domain, and the MIC_Locator proposed in this paper can speed up validation of protein annotation, knowledge of protein function and proteomics research.
Collapse
Affiliation(s)
- Fan Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China. .,Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, 02115, USA.
| | - Yang Liu
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Yanbin Wang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhen Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| |
Collapse
|
16
|
A Bi-LSTM Based Ensemble Algorithm for Prediction of Protein Secondary Structure. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9173538] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The prediction of protein secondary structure continues to be an active area of research in bioinformatics. In this paper, a Bi-LSTM based ensemble model is developed for the prediction of protein secondary structure. The ensemble model with dual loss function consists of five sub-models, which are finally joined by a Bi-LSTM layer. In contrast to existing ensemble methods, which generally train each sub-model and then join them as a whole, this ensemble model and sub-models can be trained simultaneously and the performance of each model can be observed and compared during the training process. Three independent test sets (e.g., data1199, 513 protein Cuff & Barton set (CB513) and 203 proteins from Critical Appraisals Skills Programme (CASP203)) are employed to test the method. On average, the ensemble model achieved 84.3% in Q 3 accuracy and 81.9% in segment overlap measure ( SOV ) score by using 10-fold cross validation. There is an improvement of up to 1% over some state-of-the-art prediction methods of protein secondary structure.
Collapse
|
17
|
Orioli T, Vihinen M. Benchmarking subcellular localization and variant tolerance predictors on membrane proteins. BMC Genomics 2019; 20:547. [PMID: 31307390 PMCID: PMC6631444 DOI: 10.1186/s12864-019-5865-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Background Membrane proteins constitute up to 30% of the human proteome. These proteins have special properties because the transmembrane segments are embedded into lipid bilayer while extramembranous parts are in different environments. Membrane proteins have several functions and are involved in numerous diseases. A large number of prediction methods have been introduced to predict protein subcellular localization as well as the tolerance or pathogenicity of amino acid substitutions. Results We tested the performance of 22 tolerance predictors by collecting information on membrane proteins and variants in them. The analysis indicated that the best tools had similar prediction performance on transmembrane, inside and outside regions of transmembrane proteins and comparable to overall prediction performances for all types of proteins. PON-P2 had the highest performance followed by REVEL, MetaSVM and VEST3. Further, we tested with the high quality dataset also the performance of seven subcellular localization predictors on membrane proteins. We assessed separately the performance for single pass and multi pass membrane proteins. Predictions for multi pass proteins were more reliable than those for single pass proteins. Conclusions The predictors for variant effects had better performance than subcellular localization tools. The best tolerance predictors are highly reliable. As there are large differences in the performances of tools, end-users have to be cautious in method selection. Electronic supplementary material The online version of this article (10.1186/s12864-019-5865-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tommaso Orioli
- International Master in Bioinformatics, School of Science, University of Bologna, Bologna, Italy.,Department of Experimental Medical Science, BMC B13, Lund University, SE-22184, Lund, Sweden
| | - Mauno Vihinen
- Department of Experimental Medical Science, BMC B13, Lund University, SE-22184, Lund, Sweden.
| |
Collapse
|
18
|
Abstract
Background:
Revealing the subcellular location of a newly discovered protein can
bring insight into their function and guide research at the cellular level. The experimental methods
currently used to identify the protein subcellular locations are both time-consuming and expensive.
Thus, it is highly desired to develop computational methods for efficiently and effectively identifying
the protein subcellular locations. Especially, the rapidly increasing number of protein sequences
entering the genome databases has called for the development of automated analysis methods.
Methods:
In this review, we will describe the recent advances in predicting the protein subcellular
locations with machine learning from the following aspects: i) Protein subcellular location benchmark
dataset construction, ii) Protein feature representation and feature descriptors, iii) Common
machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web
servers.
Result & Conclusion:
Concomitant with a large number of protein sequences generated by highthroughput
technologies, four future directions for predicting protein subcellular locations with
machine learning should be paid attention. One direction is the selection of novel and effective features
(e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins.
Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth
one is the protein multiple location sites prediction.
Collapse
Affiliation(s)
- Ting-He Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Shao-Wu Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
19
|
Pankow S, Martínez-Bartolomé S, Bamberger C, Yates JR. Understanding molecular mechanisms of disease through spatial proteomics. Curr Opin Chem Biol 2018; 48:19-25. [PMID: 30308467 DOI: 10.1016/j.cbpa.2018.09.016] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2018] [Revised: 09/17/2018] [Accepted: 09/19/2018] [Indexed: 02/07/2023]
Abstract
Mammalian cells are organized into different compartments that separate and facilitate physiological processes by providing specialized local environments and allowing different, otherwise incompatible biological processes to be carried out simultaneously. Proteins are targeted to these subcellular locations where they fulfill specialized, compartment-specific functions. Spatial proteomics aims to localize and quantify proteins within subcellular structures.
Collapse
Affiliation(s)
- Sandra Pankow
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, United States
| | | | - Casimir Bamberger
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, United States
| | - John R Yates
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, United States.
| |
Collapse
|
20
|
Savojardo C, Martelli P, Fariselli P, Profiti G, Casadio R. BUSCA: an integrative web server to predict subcellular localization of proteins. Nucleic Acids Res 2018; 46:W459-W466. [PMID: 29718411 PMCID: PMC6031068 DOI: 10.1093/nar/gky320] [Citation(s) in RCA: 280] [Impact Index Per Article: 40.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 04/12/2018] [Accepted: 04/17/2018] [Indexed: 12/28/2022] Open
Abstract
Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization.
Collapse
Affiliation(s)
- Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
| | - Piero Fariselli
- Department of Comparative Biomedicine and Food Science, University of Padova, Padova 35020, Italy
| | - Giuseppe Profiti
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
- Institute of Biomembrane, Bioenergetics and Molecular Biotechnologies, Italian National Research Council (CNR), Bari 70126, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40100, Italy
- Institute of Biomembrane, Bioenergetics and Molecular Biotechnologies, Italian National Research Council (CNR), Bari 70126, Italy
| |
Collapse
|
21
|
Salvatore M, Shu N, Elofsson A. The SubCons webserver: A user friendly web interface for state-of-the-art subcellular localization prediction. Protein Sci 2018; 27:195-201. [PMID: 28901589 PMCID: PMC5734273 DOI: 10.1002/pro.3297] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Revised: 09/10/2017] [Accepted: 09/11/2017] [Indexed: 12/21/2022]
Abstract
SubCons is a recently developed method that predicts the subcellular localization of a protein. It combines predictions from four predictors using a Random Forest classifier. Here, we present the user-friendly web-interface implementation of SubCons. Starting from a protein sequence, the server rapidly predicts the subcellular localizations of an individual protein. In addition, the server accepts the submission of sets of proteins either by uploading the files or programmatically by using command line WSDL API scripts. This makes SubCons ideal for proteome wide analyses allowing the user to scan a whole proteome in few days. From the web page, it is also possible to download precalculated predictions for several eukaryotic organisms. To evaluate the performance of SubCons we present a benchmark of LocTree3 and SubCons using two recent mass-spectrometry based datasets of mouse and drosophila proteins. The server is available at http://subcons.bioinfo.se/.
Collapse
Affiliation(s)
- M. Salvatore
- Science for Life LaboratoryStockholm University171 21SolnaSweden
- Department of Biochemistry and BiophysicsStockholm University106 91StockholmSweden
| | - N. Shu
- Science for Life LaboratoryStockholm University171 21SolnaSweden
- Department of Biochemistry and BiophysicsStockholm University106 91StockholmSweden
- Sweden Bioinformatics Infrastructure for Life Sciences (BILS), Stockholm UniversityStockholmSweden
| | - A. Elofsson
- Science for Life LaboratoryStockholm University171 21SolnaSweden
- Department of Biochemistry and BiophysicsStockholm University106 91StockholmSweden
| |
Collapse
|