1
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
2
|
Blake KS, Kumar H, Loganathan A, Williford EE, Diorio-Toth L, Xue YP, Tang WK, Campbell TP, Chong DD, Angtuaco S, Wencewicz TA, Tolia NH, Dantas G. Sequence-structure-function characterization of the emerging tetracycline destructase family of antibiotic resistance enzymes. Commun Biol 2024; 7:336. [PMID: 38493211 PMCID: PMC10944477 DOI: 10.1038/s42003-024-06023-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 03/07/2024] [Indexed: 03/18/2024] Open
Abstract
Tetracycline destructases (TDases) are flavin monooxygenases which can confer resistance to all generations of tetracycline antibiotics. The recent increase in the number and diversity of reported TDase sequences enables a deep investigation of the TDase sequence-structure-function landscape. Here, we evaluate the sequence determinants of TDase function through two complementary approaches: (1) constructing profile hidden Markov models to predict new TDases, and (2) using multiple sequence alignments to identify conserved positions important to protein function. Using the HMM-based approach we screened 50 high-scoring candidate sequences in Escherichia coli, leading to the discovery of 13 new TDases. The X-ray crystal structures of two new enzymes from Legionella species were determined, and the ability of anhydrotetracycline to inhibit their tetracycline-inactivating activity was confirmed. Using the MSA-based approach we identified 31 amino acid positions 100% conserved across all known TDase sequences. The roles of these positions were analyzed by alanine-scanning mutagenesis in two TDases, to study the impact on cell and in vitro activity, structure, and stability. These results expand the diversity of TDase sequences and provide valuable insights into the roles of important residues in TDases, and flavin monooxygenases more broadly.
Collapse
Affiliation(s)
- Kevin S Blake
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Hirdesh Kumar
- Host-Pathogen Interactions and Structural Vaccinology section (HPISV), National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Anisha Loganathan
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Emily E Williford
- Department of Chemistry, Washington University in St. Louis, St. Louis, MO, USA
| | - Luke Diorio-Toth
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Yao-Peng Xue
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Wai Kwan Tang
- Host-Pathogen Interactions and Structural Vaccinology section (HPISV), National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Tayte P Campbell
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - David D Chong
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Steven Angtuaco
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Timothy A Wencewicz
- Department of Chemistry, Washington University in St. Louis, St. Louis, MO, USA.
| | - Niraj H Tolia
- Host-Pathogen Interactions and Structural Vaccinology section (HPISV), National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health (NIH), Bethesda, MD, USA.
| | - Gautam Dantas
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Pathology and Immunology, Division of Laboratory and Genomic Medicine, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Molecular Microbiology, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Biomedical Engineering, Washington University School of Medicine, St. Louis, MO, USA.
- Department of Pediatrics, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
3
|
Zhang X, Wang L, Liu H, Zhang X, Liu B, Wang Y, Li J. Prot2GO: Predicting GO Annotations From Protein Sequences and Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2772-2780. [PMID: 34971539 DOI: 10.1109/tcbb.2021.3139841] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein is the main material basis of living organisms and plays crucial role in life activities. Understanding the function of protein is of great significance for new drug discovery, disease treatment and vaccine development. In recent years, with the widespread application of deep learning in bioinformatics, researchers have proposed many deep learning models to predict protein functions. However, the existing deep learning methods usually only consider protein sequences, and thus cannot effectively integrate multi-source data to annotate protein functions. In this article, we propose the Prot2GO model, which can integrate protein sequence and PPI network data to predict protein functions. We utilize an improved biased random walk algorithm to extract the features of PPI network. For sequence data, we use a convolutional neural network to obtain the local features of the sequence and a recurrent neural network to capture the long-range associations between amino acid residues in protein sequence. Moreover, Prot2GO adopts the attention mechanism to identify protein motifs and structural domains. Experiments show that Prot2GO model achieves the state-of-the-art performance on multiple metrics.
Collapse
|
4
|
Roterman I, Stapor K, Konieczny L. Engagement of intrinsic disordered proteins in protein-protein interaction. Front Mol Biosci 2023; 10:1230922. [PMID: 37583961 PMCID: PMC10423874 DOI: 10.3389/fmolb.2023.1230922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 07/11/2023] [Indexed: 08/17/2023] Open
Abstract
Proteins from the intrinsically disordered group (IDP) focus the attention of many researchers engaged in protein structure analysis. The main criteria used in their identification are lack of secondary structure and significant structural variability. This variability takes forms that cannot be identified in the X-ray technique. In the present study, different criteria were used to assess the status of IDP proteins and their fragments recognized as intrinsically disordered regions (IDRs). The status of the hydrophobic core in proteins identified as IDPs and in their complexes was assessed. The status of IDRs as components of the ordering structure resulting from the construction of the hydrophobic core was also assessed. The hydrophobic core is understood as a structure encompassing the entire molecule in the form of a centrally located high concentration of hydrophobicity and a shell with a gradually decreasing level of hydrophobicity until it reaches a level close to zero on the protein surface. It is a model assuming that the protein folding process follows a micellization pattern aiming at exposing polar residues on the surface, with the simultaneous isolation of hydrophobic amino acids from the polar aquatic environment. The use of the model of hydrophobicity distribution in proteins in the form of the 3D Gaussian distribution described on the protein particle introduces the possibility of assessing the degree of similarity to the assumed micelle-like distribution and also enables the identification of deviations and mismatch between the actual distribution and the idealized distribution. The FOD (fuzzy oil drop) model and its modified FOD-M version allow for the quantitative assessment of these differences and the assessment of the relationship of these areas to the protein function. In the present work, the sections of IDRs in protein complexes classified as IDPs are analyzed. The classification "disordered" in the structural sense (lack of secondary structure or high flexibility) does not always entail a mismatch with the structure of the hydrophobic core. Particularly, the interface area, often consisting of IDRs, in many analyzed complexes shows the compliance of the hydrophobicity distribution with the idealized distribution, which proves that matching to the structure of the hydrophobic core does not require secondary structure ordering.
Collapse
Affiliation(s)
- Irena Roterman
- Department of Bioinformatics and Telemedicine, Jagiellonian University—Medical College, Kraków, Poland
| | - Katarzyna Stapor
- Department of Applied Informatics, Faculty of Automatic, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Leszek Konieczny
- Chair of Medical Biochemistry, Medical College, Jagiellonian University, Kraków, Poland
| |
Collapse
|
5
|
Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022; 19:1109-1115. [PMID: 36038728 DOI: 10.1038/s41592-022-01585-1] [Citation(s) in RCA: 164] [Impact Index Per Article: 54.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 07/19/2022] [Indexed: 11/09/2022]
Abstract
Structure comparison and alignment are of fundamental importance in structural biology studies. We developed the first universal platform, US-align, to uniformly align monomer and complex structures of different macromolecules-proteins, RNAs and DNAs. The pipeline is built on a uniform TM-score objective function coupled with a heuristic alignment searching algorithm. Large-scale benchmarks demonstrated consistent advantages of US-align over state-of-the-art methods in pairwise and multiple structure alignments of different molecules. Detailed analyses showed that the main advantage of US-align lies in the extensive optimization of the unified objective function powered by efficient heuristic search iterations, which substantially improve the accuracy and speed of the structural alignment process. Meanwhile, the universal protocol fusing different molecular and structural types helps facilitate the heterogeneous oligomer structure comparison and template-based protein-protein and protein-RNA/DNA docking.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Morgan Shine
- Yale Combined Program in the Biological and Biomedical Sciences, Yale University, New Haven, CT, USA
| | - Anna Marie Pyle
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.,Yale Combined Program in the Biological and Biomedical Sciences, Yale University, New Haven, CT, USA.,Department of Chemistry, Yale University, New Haven, CT, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. .,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
6
|
Vicedomini R, Bouly JP, Laine E, Falciatore A, Carbone A. Multiple profile models extract features from protein sequence data and resolve functional diversity of very different protein families. Mol Biol Evol 2022; 39:6556147. [PMID: 35353898 PMCID: PMC9016551 DOI: 10.1093/molbev/msac070] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyse sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. Profile-View agrees with the large set of functional data collected for these proteins from the literature regarding the organisation into functional subgroups and residues that characterise the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.
Collapse
Affiliation(s)
- R Vicedomini
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,Sorbonne Université, Institut des Sciences du Calcul et des Données
| | - J P Bouly
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,CNRS, Sorbonne Université Institut de Biologie Physico-Chimique, Laboratory of Chloroplast Biology and Light Sensing in Microalgae - UMR7141, Paris, France
| | - E Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France
| | - A Falciatore
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,CNRS, Sorbonne Université Institut de Biologie Physico-Chimique, Laboratory of Chloroplast Biology and Light Sensing in Microalgae - UMR7141, Paris, France
| | - A Carbone
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,Institut Universitaire de France, Paris 75005, France
| |
Collapse
|
7
|
Wehrspan ZJ, McDonnell RT, Elcock AH. Identification of Iron-Sulfur (Fe-S) Cluster and Zinc (Zn) Binding Sites Within Proteomes Predicted by DeepMind's AlphaFold2 Program Dramatically Expands the Metalloproteome. J Mol Biol 2022; 434:167377. [PMID: 34838520 PMCID: PMC8785651 DOI: 10.1016/j.jmb.2021.167377] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 11/17/2021] [Accepted: 11/18/2021] [Indexed: 02/01/2023]
Abstract
DeepMind's AlphaFold2 software has ushered in a revolution in high quality, 3D protein structure prediction. In very recent work by the DeepMind team, structure predictions have been made for entire proteomes of twenty-one organisms, with >360,000 structures made available for download. Here we show that thousands of novel binding sites for iron-sulfur (Fe-S) clusters and zinc (Zn) ions can be identified within these predicted structures by exhaustive enumeration of all potential ligand-binding orientations. We demonstrate that AlphaFold2 routinely makes highly specific predictions of ligand binding sites: for example, binding sites that are comprised exclusively of four cysteine sidechains fall into three clusters, representing binding sites for 4Fe-4S clusters, 2Fe-2S clusters, or individual Zn ions. We show further: (a) that the majority of known Fe-S cluster and Zn binding sites documented in UniProt are recovered by the AlphaFold2 structures, (b) that there are occasional disputes between AlphaFold2 and UniProt with AlphaFold2 predicting highly plausible alternative binding sites, (c) that the Fe-S cluster binding sites that we identify in E. coli agree well with previous bioinformatics predictions, (d) that cysteines predicted here to be part of ligand binding sites show little overlap with those shown via chemoproteomics techniques to be highly reactive, and (e) that AlphaFold2 occasionally appears to build erroneous disulfide bonds between cysteines that should instead coordinate a ligand. These results suggest that AlphaFold2 could be an important tool for the functional annotation of proteomes, and the methodology presented here is likely to be useful for predicting other ligand-binding sites.
Collapse
Affiliation(s)
| | | | - Adrian H Elcock
- Department of Biochemistry, University of Iowa, Iowa City, IA, USA.
| |
Collapse
|
8
|
Mansoor M, Nauman M, Ur Rehman H, Benso A. Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction. Soft comput 2022. [DOI: 10.1007/s00500-021-06707-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
9
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
10
|
TwinCons: Conservation score for uncovering deep sequence similarity and divergence. PLoS Comput Biol 2021; 17:e1009541. [PMID: 34714829 PMCID: PMC8580257 DOI: 10.1371/journal.pcbi.1009541] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/10/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022] Open
Abstract
We have developed the program TwinCons, to detect noisy signals of deep ancestry of proteins or nucleic acids. As input, the program uses a composite alignment containing pre-defined groups, and mathematically determines a 'cost' of transforming one group to the other at each position of the alignment. The output distinguishes conserved, variable and signature positions. A signature is conserved within groups but differs between groups. The method automatically detects continuous characteristic stretches (segments) within alignments. TwinCons provides a convenient representation of conserved, variable and signature positions as a single score, enabling the structural mapping and visualization of these characteristics. Structure is more conserved than sequence. TwinCons highlights alternative sequences of conserved structures. Using TwinCons, we detected highly similar segments between proteins from the translation and transcription systems. TwinCons detects conserved residues within regions of high functional importance for the ribosomal RNA (rRNA) and demonstrates that signatures are not confined to specific regions but are distributed across the rRNA structure. The ability to evaluate both nucleic acid and protein alignments allows TwinCons to be used in combined sequence and structural analysis of signatures and conservation in rRNA and in ribosomal proteins (rProteins). TwinCons detects a strong sequence conservation signal between bacterial and archaeal rProteins related by circular permutation. This conserved sequence is structurally colocalized with conserved rRNA, indicated by TwinCons scores of rRNA alignments of bacterial and archaeal groups. This combined analysis revealed deep co-evolution of rRNA and rProtein buried within the deepest branching points in the tree of life.
Collapse
|
11
|
An integrated deep learning and dynamic programming method for predicting tumor suppressor genes, oncogenes, and fusion from PDB structures. Comput Biol Med 2021; 133:104323. [PMID: 33934067 DOI: 10.1016/j.compbiomed.2021.104323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Revised: 02/18/2021] [Accepted: 03/07/2021] [Indexed: 11/20/2022]
Abstract
Mutations in proto-oncogenes (ONGO) and the loss of regulatory function of tumor suppression genes (TSG) are the common underlying mechanism for uncontrolled tumor growth. While cancer is a heterogeneous complex of distinct diseases, finding the potentiality of the genes related functionality to ONGO or TSG through computational studies can help develop drugs that target the disease. This paper proposes a classification method that starts with a preprocessing stage to extract the feature map sets from the input 3D protein structural information. The next stage is a deep convolutional neural network stage (DCNN) that outputs the probability of functional classification of genes. We explored and tested two approaches: in Approach 1, all filtered and cleaned 3D-protein-structures (PDB) are pooled together, whereas in Approach 2, the primary structures and their corresponding PDBs are separated according to the genes' primary structural information. Following the DCNN stage, a dynamic programming-based method is used to determine the final prediction of the primary structures' functionality. We validated our proposed method using the COSMIC online database. For the ONGO vs TSG classification problem the AUROC of the DCNN stage for Approach 1 and Approach 2 DCNN are 0.978 and 0.765, respectively. The AUROCs of the final genes' primary structure functionality classification for Approach 1 and Approach 2 are 0.989, and 0.879, respectively. For comparison, the current state-of-the-art reported AUROC is 0.924. Our results warrant further study to apply the deep learning models to humans' (GRCh38) genes, for predicting their corresponding probabilities of functionality in the cancer drivers.
Collapse
|
12
|
Mohamed SK. Predicting tissue-specific protein functions using multi-part tensor decomposition. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.08.061] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
13
|
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, et alZhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019; 20:244. [PMID: 31744546 PMCID: PMC6864930 DOI: 10.1186/s13059-019-1835-8] [Show More Authors] [Citation(s) in RCA: 219] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 09/24/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Collapse
Affiliation(s)
- Naihui Zhou
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Yuxiang Jiang
- Indiana University Bloomington, Bloomington, Indiana, USA
| | - Timothy R Bergquist
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Alexandra J Lee
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Balint Z Kacsoh
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Molecular and Systems Biology, Hanover, NH, USA
| | - Alex W Crocker
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Huy N Nguyen
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Computer Science, Ames, IA, USA
| | - Md Nafiz Hamid
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Larry Davis
- Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Tunca Dogan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Ahmet S Rifaioglu
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey.,Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey
| | - Alperen Dalkıran
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Rengul Cetin Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rebecca L Hurto
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | | | - Fran Supek
- Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - José M Fernández
- INB Coordination Unit, Life Sciences Department, Barcelona Supercomputing Center, Barcelona, Catalonia, Spain.,(former) INB GN2, Structural and Computational Biology Programme, Spanish National Cancer Research Centre, Barcelona, Catalonia, Spain
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Vladimir R Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Radoslav S Davidović
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering, University of California Berkeley, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,National Research Council, IBIOM, Bologna, Italy
| | - Castrense Savojardo
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Florian Boecker
- University of Bonn: INRES Crop Bioinformatics, Bonn, North Rhine-Westphalia, Germany
| | - Heiko Schoof
- INRES Crop Bioinformatics, University of Bonn, Bonn, Germany
| | - Indika Kahanda
- Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
| | - Natalie Thurlby
- University of Bristol, Computer Science, Bristol, Bristol, United Kingdom
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,RESIST, DFG Cluster of Excellence 2155, Brunswick, Germany
| | - Alexandre Renaux
- Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles - Vrije Universiteit Brussel, Brussels, Belgium.,Machine Learning Group, Université libre de Bruxelles, Brussels, Belgium.,Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Rabie Saidi
- European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Alex A Freitas
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Magdalena Antczak
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Fabio Fabris
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Jie Hou
- University of Missouri, Computer Science, Columbia, Missouri, USA.,Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Zheng Wang
- University of Miami, Coral Gables, Florida, USA
| | - Alfonso E Romero
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alberto Paccanaro
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland.,Technical University of Munich, Garching, Germany
| | - Tatyana Goldberg
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Chenguang Zhao
- Faculty for Informatics, Garching, Germany.,Department for Bioinformatics and Computational Biology, Garching, Germany.,School of Computing Sciences and Computer Engineering, Hattiesburg, Mississippi, USA
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Alan J Medlar
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Elaine Zosa
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | | | - Ilya Novikov
- Baylor College of Medicine, Department of Biochemistry and Molecular Biology, Houston, TX, USA
| | - Angela Wilkins
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Po-Han Chi
- National TsingHua University, Hsinchu, Taiwan
| | - Wei-Cheng Tseng
- Department of Electrical Engineering in National Tsing Hua University, Hsinchu City, Taiwan
| | - Michal Linial
- The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Peter W Rose
- University of California San Diego, San Diego Supercomputer Center, La Jolla, California, USA
| | - Christophe Dessimoz
- Department of Computational Biology and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Genetics, Evolution & Environment, and Department of Computer Science, University College London, London, UK.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Vedrana Vidulin
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
| | - Saso Dzeroski
- Jozef Stefan Institute, Ljubljana, Slovenia.,Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Ian Sillitoe
- Research Department of Structural and Molecular Biology, University College London, London, England
| | - Sayoni Das
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan Gill Lees
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom.,Department of Health and Life Sciences, Oxford Brookes University, London, UK
| | - David T Jones
- The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - Cen Wan
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Domenico Cozzetto
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Rui Fa
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Mateo Torres
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alex Warwick Vesztrocy
- Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom.,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Marco Frasca
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Notaro
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Alessandro Petrini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Matteo Re
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giorgio Valentini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Mesiti
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy.,Institut de Biologie Computationnelle, LIRMM, CNRS-UMR 5506, Universite de Montpellier, Montpellier, France
| | - Daniel B Roche
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Jonas Reeb
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - David W Ritchie
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | - Sabeur Aridhi
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | | | - Marie-Dominique Devignes
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,University of Lorraine, Nancy, Lorraine, France.,Inria, Nancy, France
| | | | - Richard Bonneau
- NYU Center for Data Science, New York, 10010, NY, USA.,Flatiron Institute, CCB, New York, 10010, NY, USA
| | - Vladimir Gligorijević
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Meet Barot
- Center for Data Science, New York University, New York, 10011, NY, USA
| | - Hai Fang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Biology, University of Padova, Padova, Italy
| | - Michele Berselli
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- CNR Institute of Neuroscience, Padova, Italy.,Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Hafeez Ur Rehman
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar, Khyber Pakhtoonkhwa, Pakistan
| | - Qizhong Mao
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.,University of California, Riverside, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Gage S Black
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dane Jo
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Erica Suh
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Jonathan B Dayton
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dallas J Larsen
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Ashton R Omdahl
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Liam J McGuffin
- School of Biological Sciences, University of Reading, Reading, England, United Kingdom
| | | | - Patricia C Babbitt
- Department of Pharmaceutical Chemistry, San Francisco, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Jeffrey M Yunes
- UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, 94158, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, Shanghai, China.,Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Zihan Zhang
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Suyang Dai
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China.,Department of Pediatrics, Brain Tumor Center, Division of Experimental Hematology and Cancer Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Caleb Chandler
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Miguel Amezola
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Devon Johnson
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Jia-Ming Chang
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Wen-Hung Liao
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Yi-Wei Liu
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | | | | | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Maxat Kulmanov
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Imane Boudellioua
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Computer, Electrical and Mathematical Sciences Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Kai Hakala
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku, Turku, Finland
| | - Farrokh Mehryary
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Suwisa Kaewphan
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Jari Björne
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | | | | | - Tapio Salakoski
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | - Daisuke Kihara
- Department of Biological Sciences, Department of Computer Science, Purdue University, 47907, IN, USA.,Department of Pediatrics, University of Cincinnati, Cincinnati, 45229, OH, USA
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Tomislav Šmuc
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
| | - Adrian Altenhoff
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany.,Institute for Food and Plant Sciences WZW, Technische Universität München, Freising, Germany
| | | | - Christine A Orengo
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Constance J Jeffery
- Biological Sciences, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Giovanni Bosco
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Deborah A Hogan
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, Pennsylvania, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
| | - Iddo Friedberg
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.
| |
Collapse
|
14
|
Saldaño TE, Tosatto SCE, Parisi G, Fernandez-Alberti S. Network analysis of dynamically important residues in protein structures mediating ligand-binding conformational changes. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2019; 48:559-568. [PMID: 31273390 DOI: 10.1007/s00249-019-01384-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/31/2019] [Accepted: 07/01/2019] [Indexed: 11/26/2022]
Abstract
According to the generalized conformational selection model, ligand binding involves the co-existence of at least two conformers with different ligand-affinities in a dynamical equilibrium. Conformational transitions between them should be guaranteed by intramolecular vibrational dynamics associated to each conformation. These motions are, therefore, related to the biological function of a protein. Positions whose mutations are found to alter these vibrations the most can be defined as key positions, that is, dynamically important residues that mediate the ligand-binding conformational change. In a previous study, we have shown that these positions are evolutionarily conserved. They correspond to buried aliphatic residues mostly localized in regular structured regions of the protein like β-sheets and α-helices. In the present paper, we perform a network analysis of these key positions for a large dataset of paired protein structures in the ligand-free and ligand-bound form. We observe that networks of interactions between these key positions present larger and more integrated networks with faster transmission of the information. Besides, networks of residues result that are robust to conformational changes. Our results reveal that the conformational diversity of proteins seems to be guaranteed by a network of strongly interconnected key positions rather than individual residues.
Collapse
Affiliation(s)
- Tadeo E Saldaño
- Universidad Nacional de Quilmes/CONICET, Roque Saenz Peña 352, B1876BXD, Bernal, Argentina
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Viale G. Colombo 3, 5131, Padua, Italy
| | - Gustavo Parisi
- Universidad Nacional de Quilmes/CONICET, Roque Saenz Peña 352, B1876BXD, Bernal, Argentina
| | | |
Collapse
|
15
|
Huang G. Computational Models or Methods for Protein Function Prediction. CURR PROTEOMICS 2019. [DOI: 10.2174/157016461605190510114117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan Shaoyang University Shaoyang, Shaoyang, Hunan 422000, China
| |
Collapse
|
16
|
Taha K, Iraqi Y, Al Aamri A. Predicting protein functions by applying predicate logic to biomedical literature. BMC Bioinformatics 2019; 20:71. [PMID: 30736739 PMCID: PMC6368809 DOI: 10.1186/s12859-019-2594-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Accepted: 01/03/2019] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND A large number of computational methods have been proposed for predicting protein functions. The underlying techniques adopted by most of these methods revolve around predicting the functions of an unannotated protein p from already annotated proteins that have similar characteristics as p. Recent Information Extraction methods take advantage of the huge growth of biomedical literature to predict protein functions. They extract biological molecule terms that directly describe protein functions from biomedical texts. However, they consider only explicitly mentioned terms that co-occur with proteins in texts. We observe that some important biological molecule terms pertaining functional categories may implicitly co-occur with proteins in texts. Therefore, the methods that rely solely on explicitly mentioned terms in texts may miss vital functional information implicitly mentioned in the texts. RESULTS To overcome the limitations of methods that rely solely on explicitly mentioned terms in texts to predict protein functions, we propose in this paper an Information Extraction system called PL-PPF. The proposed system employs techniques for predicting the functions of proteins based on their co-occurrences with explicitly and implicitly mentioned biological molecule terms that pertain functional categories in biomedical literature. That is, PL-PPF employs a combination of statistical-based explicit term extraction techniques and logic-based implicit term extraction techniques. The statistical component of PL-PPF predicts some of the functions of a protein by extracting the explicitly mentioned functional terms that directly describe the functions of the protein from the biomedical texts associated with the protein. The logic-based component of PL-PPF predicts additional functions of the protein by inferring the functional terms that co-occur implicitly with the protein in the biomedical texts associated with it. First, the system employs its statistical-based component to extract the explicitly mentioned functional terms. Then, it employs its logic-based component to infer additional functions of the protein. Our hypothesis is that important biological molecule terms pertaining functional categories of proteins are likely to co-occur implicitly with the proteins in biomedical texts. We evaluated PL-PPF experimentally and compared it with five systems. Results revealed better prediction performance. CONCLUSIONS The experimental results showed that PL-PPF outperformed the other five systems. This is an indication of the effectiveness and practical viability of PL-PPF's combination of explicit and implicit techniques. We also evaluated two versions of PL-PPF: one adopting the complete techniques (i.e., adopting both the implicit and explicit techniques) and the other adopting only the explicit terms co-occurrence extraction techniques (i.e., without the inference rules for predicate logic). The experimental results showed that the complete version outperformed significantly the other version. This is attributed to the effectiveness of the rules of predicate logic to infer functional terms that co-occur implicitly with proteins in biomedical texts. A demo application of PL-PPF can be accessed through the following link: http://ecesrvr.kustar.ac.ae:8080/plppf/.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Youssef Iraqi
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Amira Al Aamri
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
17
|
Hu G, Wang K, Song J, Uversky VN, Kurgan L. Taxonomic Landscape of the Dark Proteomes: Whole-Proteome Scale Interplay Between Structural Darkness, Intrinsic Disorder, and Crystallization Propensity. Proteomics 2018; 18:e1800243. [PMID: 30198635 DOI: 10.1002/pmic.201800243] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Revised: 08/30/2018] [Indexed: 12/14/2022]
Abstract
Growth rate of the protein sequence universe dramatically exceeds the speed of expansion for the protein structure universe, generating an immense dark proteome that includes proteins with unknown structure. A whole-proteome scale analysis of 5.4 million proteins from 987 proteomes in the three domains of life and viruses to systematically dissect an interplay between structural coverage, degree of putative intrinsic disorder, and predicted propensity for structure determination is performed. It has been found that Archaean and Bacterial proteomes have relatively high structural coverage and low amounts of disorder, whereas Eukaryotic and Viral proteomes are characterized by a broad spread of structural coverage and higher disorder levels. The analysis reveals that dark proteomes (i.e., proteomes containing high fractions of proteins with unknown structure) have significantly elevated amounts of intrinsic disorder and are predicted to be difficult to solve structurally. Although the majority of dark proteomes are of viral origin, many dark viral proteomes have at least modest crystallization propensity and only a handful of them are enriched in the intrinsic disorder. The disorder, structural coverage, and propensity are mapped for structural determination onto a novel proteome-level sequence similarity network to analyze the interplay of these characteristics in the taxonomic landscape.
Collapse
Affiliation(s)
- Gang Hu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, P. R. China
| | - Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, P. R. China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, 33612, USA.,Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, 142290, Russia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
18
|
Fodeh SJ, Tiwari A. Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification. J Biomed Inform 2018; 86:160-166. [PMID: 30130573 DOI: 10.1016/j.jbi.2018.08.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 08/13/2018] [Accepted: 08/17/2018] [Indexed: 11/25/2022]
Abstract
Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.
Collapse
Affiliation(s)
- Samah Jamal Fodeh
- Yale Center for Medical Informatics, Yale University, 300 George st, Suite 501, New Haven, CT 06511, United States.
| | | |
Collapse
|
19
|
Zhang C, Zheng W, Freddolino PL, Zhang Y. MetaGO: Predicting Gene Ontology of Non-homologous Proteins Through Low-Resolution Protein Structure Prediction and Protein-Protein Network Mapping. J Mol Biol 2018. [PMID: 29534977 DOI: 10.1016/j.jmb.2018.03.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Homology-based transferal remains the major approach to computational protein function annotations, but it becomes increasingly unreliable when the sequence identity between query and template decreases below 30%. We propose a novel pipeline, MetaGO, to deduce Gene Ontology attributes of proteins by combining sequence homology-based annotation with low-resolution structure prediction and comparison, and partner's homology-based protein-protein network mapping. The pipeline was tested on a large-scale set of 1000 non-redundant proteins from the CAFA3 experiment. Under the stringent benchmark conditions where templates with >30% sequence identity to the query are excluded, MetaGO achieves average F-measures of 0.487, 0.408, and 0.598, for Molecular Function, Biological Process, and Cellular Component, respectively, which are significantly higher than those achieved by other state-of-the-art function annotations methods. Detailed data analysis shows that the major advantage of the MetaGO lies in the new functional homolog detections from partner's homology-based network mapping and structure-based local and global structure alignments, the confidence scores of which can be optimally combined through logistic regression. These data demonstrate the power of using a hybrid model incorporating protein structure and interaction networks to deduce new functional insights beyond traditional sequence homology-based referrals, especially for proteins that lack homologous function templates. The MetaGO pipeline is available at http://zhanglab.ccmb.med.umich.edu/MetaGO/.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
20
|
ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network. Molecules 2017; 22:molecules22101732. [PMID: 29039790 PMCID: PMC6151571 DOI: 10.3390/molecules22101732] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/11/2017] [Accepted: 10/11/2017] [Indexed: 11/25/2022] Open
Abstract
With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language “ProLan” to the protein function language “GOLan”, and build a neural machine translation model based on recurrent neural networks to translate “ProLan” language to “GOLan” language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.
Collapse
|
21
|
Ur Rehman H, Azam N, Yao J, Benso A. A three-way approach for protein function classification. PLoS One 2017; 12:e0171702. [PMID: 28234929 PMCID: PMC5325230 DOI: 10.1371/journal.pone.0171702] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 01/06/2017] [Indexed: 12/04/2022] Open
Abstract
The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy.
Collapse
Affiliation(s)
- Hafeez Ur Rehman
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar Pakistan
| | - Nouman Azam
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar Pakistan
| | - JingTao Yao
- Department of Computer Science, University of Regina, Regina, SK S4S 0A2, Canada
| | - Alfredo Benso
- Department of Computer & Control Engineering, Politecnico di Torino, I-10129, Torino, Italy
| |
Collapse
|
22
|
Abstract
A biological experiment is the most reliable way of assigning function to a protein. However, in the era of high-throughput sequencing, scientists are unable to carry out experiments to determine the function of every single gene product. Therefore, to gain insights into the activity of these molecules and guide experiments, we must rely on computational means to functionally annotate the majority of sequence data. To understand how well these algorithms perform, we have established a challenge involving a broad scientific community in which we evaluate different annotation methods according to their ability to predict the associations between previously unannotated protein sequences and Gene Ontology terms. Here we discuss the rationale, benefits, and issues associated with evaluating computational methods in an ongoing community-wide challenge.
Collapse
|
23
|
Huang G, Chu C, Huang T, Kong X, Zhang Y, Zhang N, Cai YD. Exploring Mouse Protein Function via Multiple Approaches. PLoS One 2016; 11:e0166580. [PMID: 27846315 PMCID: PMC5112993 DOI: 10.1371/journal.pone.0166580] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Accepted: 10/31/2016] [Indexed: 01/16/2023] Open
Abstract
Although the number of available protein sequences is growing exponentially, functional protein annotations lag far behind. Therefore, accurate identification of protein functions remains one of the major challenges in molecular biology. In this study, we presented a novel approach to predict mouse protein functions. The approach was a sequential combination of a similarity-based approach, an interaction-based approach and a pseudo amino acid composition-based approach. The method achieved an accuracy of about 0.8450 for the 1st-order predictions in the leave-one-out and ten-fold cross-validations. For the results yielded by the leave-one-out cross-validation, although the similarity-based approach alone achieved an accuracy of 0.8756, it was unable to predict the functions of proteins with no homologues. Comparatively, the pseudo amino acid composition-based approach alone reached an accuracy of 0.6786. Although the accuracy was lower than that of the previous approach, it could predict the functions of almost all proteins, even proteins with no homologues. Therefore, the combined method balanced the advantages and disadvantages of both approaches to achieve efficient performance. Furthermore, the results yielded by the ten-fold cross-validation indicate that the combined method is still effective and stable when there are no close homologs are available. However, the accuracy of the predicted functions can only be determined according to known protein functions based on current knowledge. Many protein functions remain unknown. By exploring the functions of proteins for which the 1st-order predicted functions are wrong but the 2nd-order predicted functions are correct, the 1st-order wrongly predicted functions were shown to be closely associated with the genes encoding the proteins. The so-called wrongly predicted functions could also potentially be correct upon future experimental verification. Therefore, the accuracy of the presented method may be much higher in reality.
Collapse
Affiliation(s)
- Guohua Huang
- Department of Mathematics, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Chen Chu
- Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Xiangyin Kong
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Yunhua Zhang
- College of Life Science, Anhui Agricultural University, Hefei, Anhui, 230036, China
| | - Ning Zhang
- Department of Biomedical Engineering, Tianjin Key Lab of Biomedical Engineering Measurement, Tianjin University, Tianjin, China
- * E-mail: (NZ); (Y-DC)
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, 99 Shangda Road, Shanghai, 200444, China
- * E-mail: (NZ); (Y-DC)
| |
Collapse
|
24
|
Du Y, Wu NC, Jiang L, Zhang T, Gong D, Shu S, Wu TT, Sun R. Annotating Protein Functional Residues by Coupling High-Throughput Fitness Profile and Homologous-Structure Analysis. mBio 2016; 7:e01801-16. [PMID: 27803181 PMCID: PMC5090041 DOI: 10.1128/mbio.01801-16] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 10/07/2016] [Indexed: 11/28/2022] Open
Abstract
Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. IMPORTANCE To fully comprehend the diverse functions of a protein, it is essential to understand the functionality of individual residues. Current methods are highly dependent on evolutionary sequence conservation, which is usually limited by sampling size. Sequence conservation-based methods are further confounded by structural constraints and multifunctionality of proteins. Here we present a method that can systematically identify and annotate functional residues of a given protein. We used a high-throughput functional profiling platform to identify essential residues. Coupling it with homologous-structure comparison, we were able to annotate multiple functions of proteins. We demonstrated the method with the PB1 protein of influenza A virus and identified novel functional residues in addition to its canonical function as an RNA-dependent RNA polymerase. Not limited to virology, this method is generally applicable to other proteins that can be functionally selected and about which homologous-structure information is available.
Collapse
Affiliation(s)
- Yushen Du
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
- Cancer Institute, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, ZJU-UCLA Joint Center for Medical Education and Research, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Nicholas C Wu
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
- Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, USA
| | - Lin Jiang
- Department of Neurology, University of California Los Angeles, Los Angeles, California, USA
| | - Tianhao Zhang
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
- Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, USA
| | - Danyang Gong
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
| | - Sara Shu
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
| | - Ting-Ting Wu
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
| | - Ren Sun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles, Los Angeles, California, USA
- Cancer Institute, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, ZJU-UCLA Joint Center for Medical Education and Research, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
- Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, USA
| |
Collapse
|
25
|
Saldaño TE, Monzon AM, Parisi G, Fernandez-Alberti S. Evolutionary Conserved Positions Define Protein Conformational Diversity. PLoS Comput Biol 2016; 12:e1004775. [PMID: 27008419 PMCID: PMC4805271 DOI: 10.1371/journal.pcbi.1004775] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Accepted: 01/27/2016] [Indexed: 12/18/2022] Open
Abstract
Conformational diversity of the native state plays a central role in modulating protein function. The selection paradigm sustains that different ligands shift the conformational equilibrium through their binding to highest-affinity conformers. Intramolecular vibrational dynamics associated to each conformation should guarantee conformational transitions, which due to its importance, could possibly be associated with evolutionary conserved traits. Normal mode analysis, based on a coarse-grained model of the protein, can provide the required information to explore these features. Herein, we present a novel procedure to identify key positions sustaining the conformational diversity associated to ligand binding. The method is applied to an adequate refined dataset of 188 paired protein structures in their bound and unbound forms. Firstly, normal modes most involved in the conformational change are selected according to their corresponding overlap with structural distortions introduced by ligand binding. The subspace defined by these modes is used to analyze the effect of simulated point mutations on preserving the conformational diversity of the protein. We find a negative correlation between the effects of mutations on these normal mode subspaces associated to ligand-binding and position-specific evolutionary conservations obtained from multiple sequence-structure alignments. Positions whose mutations are found to alter the most these subspaces are defined as key positions, that is, dynamically important residues that mediate the ligand-binding conformational change. These positions are shown to be evolutionary conserved, mostly buried aliphatic residues localized in regular structural regions of the protein like β-sheets and α-helix.
Collapse
|
26
|
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016; 93:84-91. [PMID: 26370280 PMCID: PMC4894840 DOI: 10.1016/j.ymeth.2015.09.011] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 09/03/2015] [Accepted: 09/10/2015] [Indexed: 11/30/2022] Open
Abstract
MOTIVATIONS Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. RESULTS In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure.
Collapse
Affiliation(s)
- Renzhi Cao
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
| |
Collapse
|
27
|
Chagoyen M, García-Martín JA, Pazos F. Practical analysis of specificity-determining residues in protein families. Brief Bioinform 2015; 17:255-61. [DOI: 10.1093/bib/bbv045] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Accepted: 06/15/2015] [Indexed: 12/17/2022] Open
|
28
|
Taha K, Yoo PD, Alzaabi M. iPFPi: A System for Improving Protein Function Prediction through Cumulative Iterations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:825-836. [PMID: 26357323 DOI: 10.1109/tcbb.2014.2344681] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We propose a classifier system called iPFPi that predicts the functions of un-annotated proteins. iPFPi assigns an un-annotated protein P the functions of GO annotation terms that are semantically similar to P. An un-annotated protein P and a GO annotation term T are represented by their characteristics. The characteristics of P are GO terms found within the abstracts of biomedical literature associated with P. The characteristics of Tare GO terms found within the abstracts of biomedical literature associated with the proteins annotated with the function of T. Let F and F/ be the important (dominant) sets of characteristic terms representing T and P, respectively. iPFPi would annotate P with the function of T, if F and F/ are semantically similar. We constructed a novel semantic similarity measure that takes into consideration several factors, such as the dominance degree of each characteristic term t in set F based on its score, which is a value that reflects the dominance status of t relative to other characteristic terms, using pairwise beats and looses procedure. Every time a protein P is annotated with the function of T, iPFPi updates and optimizes the current scores of the characteristic terms for T based on the weights of the characteristic terms for P. Set F will be updated accordingly. Thus, the accuracy of predicting the function of T as the function of subsequent proteins improves. This prediction accuracy keeps improving over time iteratively through the cumulative weights of the characteristic terms representing proteins that are successively annotated with the function of T. We evaluated the quality of iPFPi by comparing it experimentally with two recent protein function prediction systems. Results showed marked improvement.
Collapse
|
29
|
Sahraeian SM, Luo KR, Brenner SE. SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res 2015; 43:W141-7. [PMID: 25979264 PMCID: PMC4489292 DOI: 10.1093/nar/gkv461] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 04/27/2015] [Indexed: 12/26/2022] Open
Abstract
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.
Collapse
Affiliation(s)
- Sayed M Sahraeian
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Kevin R Luo
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
30
|
Mills CL, Beuning PJ, Ondrechen MJ. Biochemical functional predictions for protein structures of unknown or uncertain function. Comput Struct Biotechnol J 2015; 13:182-91. [PMID: 25848497 PMCID: PMC4372640 DOI: 10.1016/j.csbj.2015.02.003] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Revised: 02/06/2015] [Accepted: 02/11/2015] [Indexed: 01/07/2023] Open
Abstract
With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations.
Collapse
Affiliation(s)
- Caitlyn L Mills
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, United States
| |
Collapse
|
31
|
Text as data: using text-based features for proteins representation and for computational prediction of their characteristics. Methods 2014; 74:54-64. [PMID: 25448299 DOI: 10.1016/j.ymeth.2014.10.027] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Revised: 09/21/2014] [Accepted: 10/21/2014] [Indexed: 11/21/2022] Open
Abstract
The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.
Collapse
|
32
|
EXIA2: web server of accurate and rapid protein catalytic residue prediction. BIOMED RESEARCH INTERNATIONAL 2014; 2014:807839. [PMID: 25295274 PMCID: PMC4177735 DOI: 10.1155/2014/807839] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Revised: 05/27/2014] [Accepted: 06/11/2014] [Indexed: 11/18/2022]
Abstract
We propose a method (EXIA2) of catalytic residue prediction based on protein structure without needing homology information. The method is based on the special side chain orientation of catalytic residues. We found that the side chain of catalytic residues usually points to the center of the catalytic site. The special orientation is usually observed in catalytic residues but not in noncatalytic residues, which usually have random side chain orientation. The method is shown to be the most accurate catalytic residue prediction method currently when combined with PSI-Blast sequence conservation. It performs better than other competing methods on several benchmark datasets that include over 1,200 enzyme structures. The areas under the ROC curve (AUC) on these benchmark datasets are in the range from 0.934 to 0.968.
Collapse
|
33
|
Computational prediction of protein function based on weighted mapping of domains and GO terms. BIOMED RESEARCH INTERNATIONAL 2014; 2014:641469. [PMID: 24868539 PMCID: PMC4017789 DOI: 10.1155/2014/641469] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2013] [Accepted: 03/12/2014] [Indexed: 11/17/2022]
Abstract
In this paper, we propose a novel method, SeekFun, to predict protein function based on weighted mapping of domains and GO terms. Firstly, a weighted mapping of domains and GO terms is constructed according to GO annotations and domain composition of the proteins. The association strength between domain and GO term is weighted by symmetrical conditional probability. Secondly, the mapping is extended along the true paths of the terms based on GO hierarchy. Finally, the terms associated with resident domains are transferred to host protein and real annotations of the host protein are determined by association strengths. Our careful comparisons demonstrate that SeekFun outperforms the concerned methods on most occasions. SeekFun provides a flexible and effective way for protein function prediction. It benefits from the well-constructed mapping of domains and GO terms, as well as the reasonable strategy for inferring annotations of protein from those of its domains.
Collapse
|
34
|
Chen YH, Chiang YH, Ma HI. Analysis of spatial and temporal protein expression in the cerebral cortex after ischemia-reperfusion injury. J Clin Neurol 2014; 10:84-93. [PMID: 24829593 PMCID: PMC4017024 DOI: 10.3988/jcn.2014.10.2.84] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Revised: 09/24/2013] [Accepted: 09/26/2013] [Indexed: 01/26/2023] Open
Abstract
Background and Purpose Hypoxia, or ischemia, is a common cause of neurological deficits in the elderly. This study elucidated the mechanisms underlying ischemia-induced brain injury that results in neurological sequelae. Methods Cerebral ischemia was induced in male Sprague-Dawley rats by transient ligation of the left carotid artery followed by 60 min of hypoxia. A two-dimensional differential proteome analysis was performed using matrix-assisted laser desorption ionization-time-of-flight mass spectrometry to compare changes in protein expression on the lesioned side of the cortex relative to that on the contralateral side at 0, 6, and 24 h after ischemia. Results The expressions of the following five proteins were up-regulated in the ipsilateral cortex at 24 h after ischemia-reperfusion injury compared to the contralateral (i.e., control) side: aconitase 2, neurotensin-related peptide, hypothetical protein XP-212759, 60-kDa heat-shock protein, and aldolase A. The expression of one protein, dynamin-1, was up-regulated only at the 6-h time point. The level of 78-kDa glucose-regulated protein precursor on the lesioned side of the cerebral cortex was found to be high initially, but then down-regulated by 24 h after the induction of ischemia-reperfusion injury. The expressions of several metabolic enzymes and translational factors were also perturbed soon after brain ischemia. Conclusions These findings provide insights into the mechanisms underlying the neurodegenerative events that occur following cerebral ischemia.
Collapse
Affiliation(s)
- Yuan-Hao Chen
- Department of Neurological Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, ROC
| | - Yung-Hsiao Chiang
- Section of Neurosurgery, Department of Surgery, Taipei Medical University Hospital, Taipei Medical University, Taipei, Taiwan, ROC
| | - Hsin-I Ma
- Department of Neurological Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, ROC
| |
Collapse
|
35
|
Sandler I, Zigdon N, Levy E, Aharoni A. The functional importance of co-evolving residues in proteins. Cell Mol Life Sci 2014; 71:673-82. [PMID: 23995987 PMCID: PMC11113390 DOI: 10.1007/s00018-013-1458-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2013] [Revised: 07/26/2013] [Accepted: 08/13/2013] [Indexed: 10/26/2022]
Abstract
Computational approaches for detecting co-evolution in proteins allow for the identification of protein-protein interaction networks in different organisms and the assignment of function to under-explored proteins. The detection of co-variation of amino acids within or between proteins, moreover, allows for the discovery of residue-residue contacts and highlights functional residues that can affect the binding affinity, catalytic activity, or substrate specificity of a protein. To explore the functional impact of co-evolutionary changes in proteins, a combined experimental and computational approach must be recruited. Here, we review recent studies that apply computational and experimental tools to obtain novel insight into the structure, function, and evolution of proteins. Specifically, we describe the application of co-evolutionary analysis for predicting high-resolution three-dimensional structures of proteins. In addition, we describe computational approaches followed by experimental analysis for identifying specificity-determining residues in proteins. Finally, we discuss studies addressing the importance of such residues in terms of the functional divergence of proteins, allowing proteins to evolve new functions while avoiding crosstalk with existing cellular pathways or forming reproductive barriers and hence promoting speciation.
Collapse
Affiliation(s)
- Inga Sandler
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Nitzan Zigdon
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Efrat Levy
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Amir Aharoni
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
- National Institute for Biotechnology in the Negev (NIBN), Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| |
Collapse
|
36
|
In silico prediction of structure and functions for some proteins of male-specific region of the human Y chromosome. Interdiscip Sci 2014; 5:258-69. [PMID: 24402818 DOI: 10.1007/s12539-013-0178-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Revised: 09/03/2012] [Accepted: 11/08/2012] [Indexed: 10/25/2022]
Abstract
Male-specific region of the human Y chromosome (MSY) comprises 95% of its length that is functionally active. This portion inherits in block from father to male offspring. Most of the genes in the MSY region are involved in male-specific function, such as sex determination and spermatogenesis; also contains genes probably involved in other cellular functions. However, a detailed characterization of numerous MSY-encoded proteins still remains to be done. In this study, 12 uncharacterized proteins of MSY were analyzed through bioinformatics tools for structural and functional characterization. Within these 12 proteins, a total of 55 domains were found, with DnaJ domain signature corresponding to be the highest (11%) followed by both FAD-dependent pyridine nucleotide reductase signature and fumarate lyase superfamily signature (9%). The 3D structures of our selected proteins were built up using homology modeling and the protein threading approaches. These predicted structures confirmed in detail the stereochemistry; indicating reasonably good quality model. Furthermore the predicted functions and the proteins with whom they interact established their biological role and their mechanism of action at molecular level. The results of these structure-functional annotations provide a comprehensive view of the proteins encoded by MSY, which sheds light on their biological functions and molecular mechanisms. The data presented in this study may assist in future prognosis of several human diseases such as Turner syndrome, gonadal sex reversal, spermatogenic failure, and gonadoblastoma.
Collapse
|
37
|
Shoemaker B, Wuchty S, Panchenko AR. Computational large-scale mapping of protein-protein interactions using structural complexes. CURRENT PROTOCOLS IN PROTEIN SCIENCE 2013; 73:3.9.1-3.9.9. [PMID: 24510594 PMCID: PMC3920302 DOI: 10.1002/0471140864.ps0309s73] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Although the identification of protein interactions by high-throughput methods progresses at a fast pace, "interactome" datasets still suffer from high rates of false positives and low coverage. To map the interactome of any organism, this unit presents a computational framework to predict protein-protein or gene-gene interactions utilizing experimentally determined evidence of structural complexes, atomic details of binding interfaces and evolutionary conservation.
Collapse
Affiliation(s)
- Benjamin Shoemaker
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
| | - Stefan Wuchty
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
| | - Anna R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
38
|
Stefl S, Nishi H, Petukh M, Panchenko AR, Alexov E. Molecular mechanisms of disease-causing missense mutations. J Mol Biol 2013; 425:3919-36. [PMID: 23871686 DOI: 10.1016/j.jmb.2013.07.014] [Citation(s) in RCA: 209] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Revised: 07/04/2013] [Accepted: 07/10/2013] [Indexed: 12/23/2022]
Abstract
Genetic variations resulting in a change of amino acid sequence can have a dramatic effect on stability, hydrogen bond network, conformational dynamics, activity and many other physiologically important properties of proteins. The substitutions of only one residue in a protein sequence, so-called missense mutations, can be related to many pathological conditions and may influence susceptibility to disease and drug treatment. The plausible effects of missense mutations range from affecting the macromolecular stability to perturbing macromolecular interactions and cellular localization. Here we review the individual cases and genome-wide studies that illustrate the association between missense mutations and diseases. In addition, we emphasize that the molecular mechanisms of effects of mutations should be revealed in order to understand the disease origin. Finally, we report the current state-of-the-art methodologies that predict the effects of mutations on protein stability, the hydrogen bond network, pH dependence, conformational dynamics and protein function.
Collapse
Affiliation(s)
- Shannon Stefl
- Computational Biophysics and Bioinformatics, Department of Physics, Clemson University, Clemson, SC 29634, USA
| | | | | | | | | |
Collapse
|
39
|
Employing directed evolution for the functional analysis of multi-specific proteins. Bioorg Med Chem 2013; 21:3511-6. [DOI: 10.1016/j.bmc.2013.04.052] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 04/11/2013] [Accepted: 04/18/2013] [Indexed: 01/17/2023]
|
40
|
Gu X, Zou Y, Su Z, Huang W, Zhou Z, Arendsee Z, Zeng Y. An update of DIVERGE software for functional divergence analysis of protein family. Mol Biol Evol 2013; 30:1713-9. [PMID: 23589455 DOI: 10.1093/molbev/mst069] [Citation(s) in RCA: 146] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change at a particular amino acid site may contain insightful information about the underlying functional divergence following gene duplication. These predicted sites may be used as candidates for further experiments. We are now releasing an updated version of DIVERGE with the following improvements: 1) a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); 2) the calculation of the false discovery rate of functionally diverging sites; 3) estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs; 4) a statistical test for asymmetric functional divergence; and 5) a new method to infer functional divergence specific to a given duplicate cluster. In addition, we have made efforts to improve software design and produce a well-written software manual for the general user.
Collapse
Affiliation(s)
- Xun Gu
- State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, China.
| | | | | | | | | | | | | |
Collapse
|
41
|
Wong A, Shatkay H. Protein function prediction using text-based features extracted from the biomedical literature: the CAFA challenge. BMC Bioinformatics 2013; 14 Suppl 3:S14. [PMID: 23514326 PMCID: PMC3584852 DOI: 10.1186/1471-2105-14-s3-s14] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge. RESULTS We have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28% respectively. Results obtained as part of the CAFA evaluation itself on the CAFA dataset are reported as well. CONCLUSIONS Our evaluation shows that the text-based classifier consistently outperforms the baseline classifier that is based on prior distribution, and typically has comparable performance to the baseline classifier that uses sequence similarity. Moreover, the results suggest that combining text features with other types of features can potentially lead to improved prediction performance. The preliminary results also suggest that while our text-based classifier can be used to predict both molecular function and biological process in which a protein is involved, the classifier performs significantly better for predicting molecular function than for predicting biological process. A similar trend was observed for other classifiers participating in the CAFA challenge.
Collapse
Affiliation(s)
- Andrew Wong
- Computational Biology and Machine Learning Lab, School of Computing, Queen's University, Kingston, ON, K7L 3N6, Canada
| | | |
Collapse
|
42
|
Lopez D, Pazos F. Concomitant prediction of function and fold at the domain level with GO-based profiles. BMC Bioinformatics 2013; 14 Suppl 3:S12. [PMID: 23514233 PMCID: PMC3584904 DOI: 10.1186/1471-2105-14-s3-s12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Predicting the function of newly sequenced proteins is crucial due to the pace at which these raw sequences are being obtained. Almost all resources for predicting protein function assign functional terms to whole chains, and do not distinguish which particular domain is responsible for the allocated function. This is not a limitation of the methodologies themselves but it is due to the fact that in the databases of functional annotations these methods use for transferring functional terms to new proteins, these annotations are done on a whole-chain basis. Nevertheless, domains are the basic evolutionary and often functional units of proteins. In many cases, the domains of a protein chain have distinct molecular functions, independent from each other. For that reason resources with functional annotations at the domain level, as well as methodologies for predicting function for individual domains adapted to these resources are required. We present a methodology for predicting the molecular function of individual domains, based on a previously developed database of functional annotations at the domain level. The approach, which we show outperforms a standard method based on sequence searches in assigning function, concomitantly predicts the structural fold of the domains and can give hints on the functionally important residues associated to the predicted function.
Collapse
Affiliation(s)
- Daniel Lopez
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid, Spain
| | | |
Collapse
|
43
|
Skolnick J, Zhou H, Gao M. Are predicted protein structures of any value for binding site prediction and virtual ligand screening? Curr Opin Struct Biol 2013; 23:191-7. [PMID: 23415854 DOI: 10.1016/j.sbi.2013.01.009] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Revised: 01/04/2013] [Accepted: 01/23/2013] [Indexed: 01/03/2023]
Abstract
The recently developed field of ligand homology modeling (LHM) that extends the ideas of protein homology modeling to the prediction of ligand binding sites and for use in virtual ligand screening has emerged as a powerful new approach. Unlike traditional docking methodologies, LHM can be applied to low-to-moderate resolution predicted as well as experimental structures with little if any diminution in performance; thereby enabling ≈ 75% of an average proteome to have potentially significant virtual screening predictions. In large scale benchmarking, LHM is able to predict off-target ligand binding. Thus, despite the widespread belief to the contrary, low-to-moderate resolution predicted structures have considerable utility for biochemical function prediction.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, 250 14th Street NW, Atlanta, GA 30318, USA.
| | | | | |
Collapse
|
44
|
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, et alRadivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10:221-7. [PMID: 23353650 PMCID: PMC3584181 DOI: 10.1038/nmeth.2340] [Show More Authors] [Citation(s) in RCA: 625] [Impact Index Per Article: 52.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 12/10/2012] [Indexed: 01/03/2023]
Abstract
A report on the results of the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Collapse
Affiliation(s)
- Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Sandler I, Abu-Qarn M, Aharoni A. Protein co-evolution: how do we combine bioinformatics and experimental approaches? MOLECULAR BIOSYSTEMS 2012; 9:175-81. [PMID: 23151606 DOI: 10.1039/c2mb25317h] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Molecular co-evolution is manifested by compensatory changes in proteins designed to enable adaptation to their natural environment. In recent years, bioinformatics approaches allowed for the detection of co-evolution at the level of the whole protein or of specific residues. Such efforts enabled prediction of protein-protein interactions, functional assignments of proteins and the identification of interacting residues, thereby providing information on protein structure. Still, despite such advances, relatively little is known regarding the functional implications of sequence divergence resulting from protein co-evolution. While bioinformatics approaches usually analyze thousands of proteins to obtain a broad view of protein co-evolution, experimental evaluation of protein co-evolution serves to study only individual proteins. In this review, we describe recent advances in bioinformatics and experimental efforts aimed at examining protein co-evolution. Accordingly, we discuss possible modes of crosstalk between the bioinformatics and experimental approaches to facilitate the identification of co-evolutionary signals in proteins and to understand their implications for the structure and function of proteins.
Collapse
Affiliation(s)
- Inga Sandler
- Department of Life Sciences, Ben-Gurion University of the Negev, Be'er Sheva 84105, Israel
| | | | | |
Collapse
|
46
|
Abstract
Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this article, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.
Collapse
|
47
|
Structural analysis of hypothetical proteins from Helicobacter pylori: an approach to estimate functions of unknown or hypothetical proteins. Int J Mol Sci 2012; 13:7109-7137. [PMID: 22837682 PMCID: PMC3397514 DOI: 10.3390/ijms13067109] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Revised: 05/29/2012] [Accepted: 06/01/2012] [Indexed: 12/12/2022] Open
Abstract
Helicobacter pylori (H. pylori) have a unique ability to survive in extreme acidic environments and to colonize the gastric mucosa. It can cause diverse gastric diseases such as peptic ulcers, chronic gastritis, mucosa-associated lymphoid tissue (MALT) lymphoma, gastric cancer, etc. Based on genomic research of H. pylori, over 1600 genes have been functionally identified so far. However, H. pylori possess some genes that are uncharacterized since: (i) the gene sequences are quite new; (ii) the function of genes have not been characterized in any other bacterial systems; and (iii) sometimes, the protein that is classified into a known protein based on the sequence homology shows some functional ambiguity, which raises questions about the function of the protein produced in H. pylori. Thus, there are still a lot of genes to be biologically or biochemically characterized to understand the whole picture of gene functions in the bacteria. In this regard, knowledge on the 3D structure of a protein, especially unknown or hypothetical protein, is frequently useful to elucidate the structure-function relationship of the uncharacterized gene product. That is, a structural comparison with known proteins provides valuable information to help predict the cellular functions of hypothetical proteins. Here, we show the 3D structures of some hypothetical proteins determined by NMR spectroscopy and X-ray crystallography as a part of the structural genomics of H. pylori. In addition, we show some successful approaches of elucidating the function of unknown proteins based on their structural information.
Collapse
|
48
|
Analyzing effects of naturally occurring missense mutations. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2012; 2012:805827. [PMID: 22577471 PMCID: PMC3346971 DOI: 10.1155/2012/805827] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Revised: 02/01/2012] [Accepted: 02/01/2012] [Indexed: 11/17/2022]
Abstract
Single-point mutation in genome, for example, single-nucleotide polymorphism (SNP) or rare genetic mutation, is the change of a single nucleotide for another in the genome sequence. Some of them will produce an amino acid substitution in the corresponding protein sequence (missense mutations); others will not. This paper focuses on genetic mutations resulting in a change in the amino acid sequence of the corresponding protein and how to assess their effects on protein wild-type characteristics. The existing methods and approaches for predicting the effects of mutation on protein stability, structure, and dynamics are outlined and discussed with respect to their underlying principles. Available resources, either as stand-alone applications or webservers, are pointed out as well. It is emphasized that understanding the molecular mechanisms behind these effects due to these missense mutations is of critical importance for detecting disease-causing mutations. The paper provides several examples of the application of 3D structure-based methods to model the effects of protein stability and protein-protein interactions caused by missense mutations as well.
Collapse
|
49
|
Sinha R, Kundrotas PJ, Vakser IA. Protein docking by the interface structure similarity: how much structure is needed? PLoS One 2012; 7:e31349. [PMID: 22348074 PMCID: PMC3278447 DOI: 10.1371/journal.pone.0031349] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Accepted: 01/08/2012] [Indexed: 11/19/2022] Open
Abstract
The increasing availability of co-crystallized protein-protein complexes provides an opportunity to use template-based modeling for protein-protein docking. Structure alignment techniques are useful in detection of remote target-template similarities. The size of the structure involved in the alignment is important for the success in modeling. This paper describes a systematic large-scale study to find the optimal definition/size of the interfaces for the structure alignment-based docking applications. The results showed that structural areas corresponding to the cutoff values <12 Å across the interface inadequately represent structural details of the interfaces. With the increase of the cutoff beyond 12 Å, the success rate for the benchmark set of 99 protein complexes, did not increase significantly for higher accuracy models, and decreased for lower-accuracy models. The 12 Å cutoff was optimal in our interface alignment-based docking, and a likely best choice for the large-scale (e.g., on the scale of the entire genome) applications to protein interaction networks. The results provide guidelines for the docking approaches, including high-throughput applications to modeled structures.
Collapse
Affiliation(s)
- Rohita Sinha
- Center for Bioinformatics, The University of Kansas, Lawrence, Kansas, United States of America
| | - Petras J. Kundrotas
- Center for Bioinformatics, The University of Kansas, Lawrence, Kansas, United States of America
- * E-mail: (PJK); (IAV)
| | - Ilya A. Vakser
- Center for Bioinformatics, The University of Kansas, Lawrence, Kansas, United States of America
- Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, United States of America
- * E-mail: (PJK); (IAV)
| |
Collapse
|
50
|
Rappoport N, Karsenty S, Stern A, Linial N, Linial M. ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree. Nucleic Acids Res 2011; 40:D313-20. [PMID: 22121228 PMCID: PMC3245180 DOI: 10.1093/nar/gkr1027] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom–up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162 088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters.
Collapse
Affiliation(s)
- Nadav Rappoport
- School of Computer Science and Engineering, Institute of Life Sciences, The Sudarsky Center for Computational Biology, The Hebrew University of Jerusalem, 91904 Israel
| | | | | | | | | |
Collapse
|