1
|
Oliveira LS, Reyes A, Dutilh BE, Gruber A. Rational Design of Profile HMMs for Sensitive and Specific Sequence Detection with Case Studies Applied to Viruses, Bacteriophages, and Casposons. Viruses 2023; 15:519. [PMID: 36851733 PMCID: PMC9966878 DOI: 10.3390/v15020519] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 02/01/2023] [Accepted: 02/09/2023] [Indexed: 02/15/2023] Open
Abstract
Profile hidden Markov models (HMMs) are a powerful way of modeling biological sequence diversity and constitute a very sensitive approach to detecting divergent sequences. Here, we report the development of protocols for the rational design of profile HMMs. These methods were implemented on TABAJARA, a program that can be used to either detect all biological sequences of a group or discriminate specific groups of sequences. By calculating position-specific information scores along a multiple sequence alignment, TABAJARA automatically identifies the most informative sequence motifs and uses them to construct profile HMMs. As a proof-of-principle, we applied TABAJARA to generate profile HMMs for the detection and classification of two viral groups presenting different evolutionary rates: bacteriophages of the Microviridae family and viruses of the Flavivirus genus. We obtained conserved models for the generic detection of any Microviridae or Flavivirus sequence, and profile HMMs that can specifically discriminate Microviridae subfamilies or Flavivirus species. In another application, we constructed Cas1 endonuclease-derived profile HMMs that can discriminate CRISPRs and casposons, two evolutionarily related transposable elements. We believe that the protocols described here, and implemented on TABAJARA, constitute a generic toolbox for generating profile HMMs for the highly sensitive and specific detection of sequence classes.
Collapse
Affiliation(s)
- Liliane S. Oliveira
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, Saint Louis, MO 63108, USA
| | - Bas E. Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich-Schiller-University Jena, 07743 Jena, Germany
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Arthur Gruber
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| |
Collapse
|
2
|
Jiang Y, Ran X, Yang ZJ. Data-driven enzyme engineering to identify function-enhancing enzymes. Protein Eng Des Sel 2023; 36:gzac009. [PMID: 36214500 PMCID: PMC10365845 DOI: 10.1093/protein/gzac009] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 08/08/2022] [Accepted: 09/28/2022] [Indexed: 01/22/2023] Open
Abstract
Identifying function-enhancing enzyme variants is a 'holy grail' challenge in protein science because it will allow researchers to expand the biocatalytic toolbox for late-stage functionalization of drug-like molecules, environmental degradation of plastics and other pollutants, and medical treatment of food allergies. Data-driven strategies, including statistical modeling, machine learning, and deep learning, have largely advanced the understanding of the sequence-structure-function relationships for enzymes. They have also enhanced the capability of predicting and designing new enzymes and enzyme variants for catalyzing the transformation of new-to-nature reactions. Here, we reviewed the recent progresses of data-driven models that were applied in identifying efficiency-enhancing mutants for catalytic reactions. We also discussed existing challenges and obstacles faced by the community. Although the review is by no means comprehensive, we hope that the discussion can inform the readers about the state-of-the-art in data-driven enzyme engineering, inspiring more joint experimental-computational efforts to develop and apply data-driven modeling to innovate biocatalysts for synthetic and pharmaceutical applications.
Collapse
Affiliation(s)
- Yaoyukun Jiang
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
| | - Xinchun Ran
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
| | - Zhongyue J Yang
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37235, USA
- Vanderbilt Institute of Chemical Biology, Vanderbilt University, Nashville, TN 37235, USA
- Data Science Institute, Vanderbilt University, Nashville, TN 37235, USA
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, TN 37235, USA
| |
Collapse
|
3
|
Erickson E, Gado JE, Avilán L, Bratti F, Brizendine RK, Cox PA, Gill R, Graham R, Kim DJ, König G, Michener WE, Poudel S, Ramirez KJ, Shakespeare TJ, Zahn M, Boyd ES, Payne CM, DuBois JL, Pickford AR, Beckham GT, McGeehan JE. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity. Nat Commun 2022; 13:7850. [PMID: 36543766 PMCID: PMC9772341 DOI: 10.1038/s41467-022-35237-x] [Citation(s) in RCA: 64] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 11/21/2022] [Indexed: 12/24/2022] Open
Abstract
Enzymatic deconstruction of poly(ethylene terephthalate) (PET) is under intense investigation, given the ability of hydrolase enzymes to depolymerize PET to its constituent monomers near the polymer glass transition temperature. To date, reported PET hydrolases have been sourced from a relatively narrow sequence space. Here, we identify additional PET-active biocatalysts from natural diversity by using bioinformatics and machine learning to mine 74 putative thermotolerant PET hydrolases. We successfully express, purify, and assay 51 enzymes from seven distinct phylogenetic groups; observing PET hydrolysis activity on amorphous PET film from 37 enzymes in reactions spanning pH from 4.5-9.0 and temperatures from 30-70 °C. We conduct PET hydrolysis time-course reactions with the best-performing enzymes, where we observe differences in substrate selectivity as function of PET morphology. We employed X-ray crystallography and AlphaFold to examine the enzyme architectures of all 74 candidates, revealing protein folds and accessory domains not previously associated with PET deconstruction. Overall, this study expands the number and diversity of thermotolerant scaffolds for enzymatic PET deconstruction.
Collapse
Affiliation(s)
- Erika Erickson
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Japheth E Gado
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Luisana Avilán
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Felicia Bratti
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Richard K Brizendine
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Paul A Cox
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Raj Gill
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Rosie Graham
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Dong-Jin Kim
- BOTTLE Consortium, Golden, CO, USA
- Department of Biochemistry, Montana State University, Bozeman, MT, USA
| | - Gerhard König
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - William E Michener
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Saroj Poudel
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | - Kelsey J Ramirez
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Thomas J Shakespeare
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Michael Zahn
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Eric S Boyd
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | | | - Jennifer L DuBois
- BOTTLE Consortium, Golden, CO, USA
- Department of Biochemistry, Montana State University, Bozeman, MT, USA
| | - Andrew R Pickford
- BOTTLE Consortium, Golden, CO, USA
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK
| | - Gregg T Beckham
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA.
- BOTTLE Consortium, Golden, CO, USA.
| | - John E McGeehan
- BOTTLE Consortium, Golden, CO, USA.
- Centre for Enzyme Innovation, School of Biological Sciences, University of Portsmouth, Portsmouth, UK.
- World Plastics Association, Fontvieille, Monaco.
| |
Collapse
|
4
|
Imran M, Munir MZ, Ialhi S, Abbas F, Younus M, Ahmad S, Naeem MK, Waseem M, Iqbal A, Gul S, Widemann E, Shafiq S. Identification and Characterization of Malate Dehydrogenases in Tomato ( Solanum lycopersicum L.). Int J Mol Sci 2022; 23:10028. [PMID: 36077425 PMCID: PMC9456053 DOI: 10.3390/ijms231710028] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 08/20/2022] [Accepted: 08/25/2022] [Indexed: 12/02/2022] Open
Abstract
Malate dehydrogenase, which facilitates the reversible conversion of malate to oxaloacetate, is essential for energy balance, plant growth, and cold and salt tolerance. However, the genome-wide study of the MDH family has not yet been carried out in tomato (Solanum lycopersicum L.). In this study, 12 MDH genes were identified from the S. lycopersicum genome and renamed according to their chromosomal location. The tomato MDH genes were split into five groups based on phylogenetic analysis and the genes that clustered together showed similar lengths, and structures, and conserved motifs in the encoded proteins. From the 12 tomato MDH genes on the chromosomes, three pairs of segmental duplication events involving four genes were found. Each pair of genes had a Ka/Ks ratio < 1, indicating that the MDH gene family of tomato was purified during evolution. Gene expression analysis exhibited that tomato MDHs were differentially expressed in different tissues, at various stages of fruit development, and differentially regulated in response to abiotic stresses. Molecular docking of four highly expressed MDHs revealed their substrate and co-factor specificity in the reversible conversion process of malate to oxaloacetate. Further, co-localization of tomato MDH genes with quantitative trait loci (QTL) of salt stress-related phenotypes revealed their broader functions in salt stress tolerance. This study lays the foundation for functional analysis of MDH genes and genetic improvement in tomato.
Collapse
Affiliation(s)
- Muhammad Imran
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Agriculture, South China Agriculture University, Guangzhou 510642, China
- School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Muhammad Zeeshan Munir
- School of Environment and Energy, Peking University Shenzhen Graduate School, 2199 Lishui Rd., Shenzhen 518055, China
| | - Sara Ialhi
- Department of Economics, Lahore College for Women University, Lahore 35200, Pakistan
| | - Farhat Abbas
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Agriculture, South China Agriculture University, Guangzhou 510642, China
| | - Muhammad Younus
- Beijing Key Laboratory of Cardiometabolic Molecular Medicine, Institute of Molecular Medicine and Peking-Tsinghua Center for Life Sciences and PKU-IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China
| | - Sajjad Ahmad
- Department of Health and Biological Sciences, Abasyn University, Peshawar 25000, Pakistan
| | - Muhmmad Kashif Naeem
- National Institute for Genomics and Advanced Biotechnology (NIGAB), National Agricultural Research Center (NARC), Park Road, Islamabad 45500, Pakistan
| | - Muhammad Waseem
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, College of Agriculture, South China Agriculture University, Guangzhou 510642, China
| | - Arshad Iqbal
- Center for Biotechnology and Microbiology, University of Swat, Mingora 19200, Pakistan
| | - Sanober Gul
- Department of Plant Breeding and Genetics, Ghazi University, Dera Ghazi Khan 32200, Pakistan
| | - Emilie Widemann
- Institut de Biologie Moléculaire des Plantes, CNRS-Université de Strasbourg, 67084 Strasbourg, France
| | - Sarfraz Shafiq
- School of Life Sciences, Tsinghua University, Beijing 100084, China
- Department of Anatomy and Cell Biology, University of Western Ontario, 1151 Richmond St., London, ON N6A5B8, Canada
| |
Collapse
|
5
|
Pascarelli S, Laurino P. Inter-paralog amino acid inversion events in large phylogenies of duplicated proteins. PLoS Comput Biol 2022; 18:e1010016. [PMID: 35377869 PMCID: PMC9009777 DOI: 10.1371/journal.pcbi.1010016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 04/14/2022] [Accepted: 03/12/2022] [Indexed: 11/25/2022] Open
Abstract
Connecting protein sequence to function is becoming increasingly relevant since high-throughput sequencing studies accumulate large amounts of genomic data. In order to go beyond the existing database annotation, it is fundamental to understand the mechanisms underlying functional inheritance and divergence. If the homology relationship between proteins is known, can we determine whether the function diverged? In this work, we analyze different possibilities of protein sequence evolution after gene duplication and identify “inter-paralog inversions”, i.e., sites where the relationship between the ancestry and the functional signal is decoupled. The amino acids in these sites are masked from being recognized by other prediction tools. Still, they play a role in functional divergence and could indicate a shift in protein function. We develop a method to specifically recognize inter-paralog amino acid inversions in a phylogeny and test it on real and simulated datasets. In a dataset built from the Epidermal Growth Factor Receptor (EGFR) sequences found in 88 fish species, we identify 19 amino acid sites that went through inversion after gene duplication, mostly located at the ligand-binding extracellular domain. Our work uncovers an outcome of protein duplications with direct implications in protein functional annotation and sequence evolution. The developed method is optimized to work with large protein datasets and can be readily included in a targeted protein analysis pipeline. Proteins are critical components of living systems because they facilitate most biological processes like protein synthesis, DNA replication, chemical catalysis, etc. Proteins are encoded in their genes. During evolution, genes accumulate mutations that get translated at the protein level. These mutations can be “neutral” if they do not affect the protein function immediately and directly; otherwise, mutations can be functional if they directly modify protein function. An event that provides an opportunity to study protein function is gene duplication namely, when two copies of a gene encoding the same protein appear. One copy of the protein often retains the same function while the other is free to diverge and specialize to a different function. This work sheds light on an alternative outcome of gene duplication that might be critical to discern between neutral and functional mutations. By looking at 88 fish genomes, we found proteins in which the evolution of their sequences does not follow the expected pattern of divergence after gene duplication. In this case, the protein sequence of a subgroup of species diverges in the copy expected to retain its function, while the sequence is retained in the expectedly divergent one. We called this event “inter-paralog amino acid inversion”. Our data shows that this “inversion” event is correlated to function, and its detection has to be considered for assigning protein functions correctly.
Collapse
Affiliation(s)
- Stefano Pascarelli
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa, Japan
| | - Paola Laurino
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa, Japan
- * E-mail:
| |
Collapse
|
6
|
Investigation and Alteration of Organic Acid Synthesis Pathways in the Mammalian Gut Symbiont Bacteroides thetaiotaomicron. Microbiol Spectr 2022; 10:e0231221. [PMID: 35196806 PMCID: PMC8865466 DOI: 10.1128/spectrum.02312-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Members of the gut-dwelling Bacteroides genus have remarkable abilities in degrading a diverse set of fiber polysaccharide structures, most of which are found in the mammalian diet. As part of their metabolism, they convert these fibers to organic acids that can in turn provide energy to their host. While many studies have identified and characterized the genes and corresponding proteins involved in polysaccharide degradation, relatively little is known about Bacteroides genes involved in downstream metabolic pathways. Bacteroides thetaiotaomicron is one of the most studied species from the genus and is representative of this group in producing multiple organic acids as part of its metabolism. We focused here on several organic acid synthesis pathways in B. thetaiotaomicron, including those involved in formate, lactate, propionate, and acetate production. We identified potential genes involved in each pathway and characterized these through gene deletions coupled to growth assays and organic acid quantification. In addition, we developed and employed a Golden Gate-compatible plasmid system to simplify alteration of native gene expression levels. Our work both validates and contradicts previous bioinformatic gene annotations, and we develop a model on which to base future efforts. A clearer understanding of Bacteroides metabolic pathways can inform and facilitate efforts to employ these bacteria for improved human health or other utilization strategies. IMPORTANCE Both humans and animals host a large community of bacteria and other microorganisms in their gastrointestinal tracts. This community breaks down dietary fiber and produces organic acids that are used as an energy source by the body and can also help the host resist infection by various pathogens. While the Bacteroides genus is one of the most common in the gut microbiota, it is only distantly related to bacteria with well-characterized metabolic pathways and it is therefore unclear whether research insights on organic acid production in those species can also be directly applied to the Bacteroides. By investigating multiple genetic pathways for organic acid production in Bacteroides thetaiotaomicron, we provide a basis for deeper understanding of these pathways. The work further enables greater understanding of Bacteroides–host relationships, as well as inter-species relationships in the microbiota, which are of importance for both human and animal gut health.
Collapse
|
7
|
Pazos F. Computational prediction of protein functional sites-Applications in biotechnology and biomedicine. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:39-57. [PMID: 35534114 DOI: 10.1016/bs.apcsb.2021.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
There are many computational approaches for predicting protein functional sites based on different sequence and structural features. These methods are essential to cope with the sequence deluge that is filling databases with uncharacterized protein sequences. They complement the more expensive and time-consuming experimental approaches by pointing them to possible candidate positions. In many cases they are jointly used to characterize the functional sites in proteins of biotechnological and biomedical interest and eventually modify them for different purposes. There is a clear trend towards approaches based on machine learning and those using structural information, due to the recent developments in these areas. Nevertheless, "classic" methods based on sequence and evolutionary features are still playing an important role as these features are strongly related to functionality. In this review, the main approaches for predicting general functional sites in a protein are discussed, with a focus on sequence-based approaches.
Collapse
Affiliation(s)
- Florencio Pazos
- Computational Systems Biology Group, National Center for Biotechnology (CNB-CSIC), Madrid, Spain.
| |
Collapse
|
8
|
Mallavarpu Ambrose J, Veeraraghavan VP, Kullappan M, Velmurugan D, Vennila R, Rupert S, Dorairaj S, Surapaneni KM. Molecular modeling studies of the effects of withaferin A and its derivatives against oncoproteins associated with breast cancer stem cell activity. Process Biochem 2021. [DOI: 10.1016/j.procbio.2021.09.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
9
|
Gado JE, Harrison BE, Sandgren M, Ståhlberg J, Beckham GT, Payne CM. Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases. J Biol Chem 2021; 297:100931. [PMID: 34216620 PMCID: PMC8329511 DOI: 10.1016/j.jbc.2021.100931] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 06/18/2021] [Accepted: 06/29/2021] [Indexed: 11/28/2022] Open
Abstract
Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.
Collapse
Affiliation(s)
- Japheth E Gado
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky, USA; Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, Colorado, USA
| | - Brent E Harrison
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Mats Sandgren
- Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Jerry Ståhlberg
- Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Gregg T Beckham
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, Colorado, USA
| | - Christina M Payne
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky, USA.
| |
Collapse
|
10
|
Littmann M, Bordin N, Heinzinger M, Schütze K, Dallago C, Orengo C, Rost B. Clustering FunFams using sequence embeddings improves EC purity. Bioinformatics 2021; 37:3449-3455. [PMID: 33978744 PMCID: PMC8545299 DOI: 10.1093/bioinformatics/btab371] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/02/2021] [Accepted: 05/11/2021] [Indexed: 12/05/2022] Open
Abstract
Motivation Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. Results We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. Availability and implementation Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maria Littmann
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany.,Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
11
|
Whole-genome sequencing, genome mining, metabolic reconstruction and evolution of pentachlorophenol and other xenobiotic degradation pathways in Bacillus tropicus strain AOA-CPS1. Funct Integr Genomics 2021; 21:171-193. [PMID: 33547987 DOI: 10.1007/s10142-021-00768-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 09/30/2020] [Accepted: 01/19/2021] [Indexed: 12/11/2022]
Abstract
A pentachlorophenol degrading bacterium was isolated from effluent of a wastewater treatment plant in Durban, South Africa, and identified as Bacillus tropicus strain AOA-CPS1 (BtAOA). The isolate degraded 29% of pentachlorophenol (PCP) within 9 days at an initial PCP concentration of 100 mg L-1 and 62% of PCP when the initial concentration was set at 350 mg L-1. The whole-genome of BtAOA was sequenced using Pacific Biosciences RS II sequencer with the Single Molecule, Real-Time (SMRT) Link (version 7.0.1.66975) and analysed using the HGAP4-de-novo assembly application. The contigs were annotated at NCBI, RASTtk and PROKKA prokaryotic genome annotation pipelines. The BtAOA genome is comprised of a 5,246,860-bp chromosome and a 58,449-bp plasmid with a GC content of 35.4%. The metabolic reconstruction for BtAOA showed that the organism has been naturally exposed to various chlorophenolic compounds including PCP and other xenobiotics. The chromosome encodes genes for core processes, stress response and PCP catabolic genes. Analogues of PCP catabolic gene (cpsBDCAE, and p450) sequences were identified from the NCBI annotation data, PCR-amplified from the whole genome of BtAOA, cloned into pET15b expression vector, overexpressed in E. coli BL21 (DE3) expression host, purified and characterized. Sequence mining and comparative analysis of the metabolic reconstruction of the BtAOA genome with closely related strains suggests that the operon encoding the first two enzymes in the PCP degradation pathway were acquired from a pre-existing pterin-carbinolamine dehydratase subsystem. The other two enzymes were recruited via horizontal gene transfer (HGT) from the pool of hypothetical proteins with no previous specific function, while the last enzyme was recruited from pre-existing enzymes from the TCA or serine-glyoxalase cycle via HGT events. This study provides a comprehensive understanding of the role of BtAOA in PCP degradation and its potential exploitation for bioremediation of other xenobiotic compounds.
Collapse
|
12
|
Sinha M, Jagadeesan R, Kumar N, Saha S, Kothandan G, Kumar D. In-silico studies on Myo inositol-1-phosphate synthase of Leishmania donovani in search of anti-leishmaniasis. J Biomol Struct Dyn 2020; 40:3371-3384. [PMID: 33200690 DOI: 10.1080/07391102.2020.1847194] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Myo-inositol is one of the vital nutritional requirements for the Leishmania parasites' survival and virulence in the mammalian host. . Myo-inositol-1-phosphate synthase (MIPS) is responsible for the synthesis of myo-inositol in Leishmania, which plays a vital role in Leishmania's virulence to mammalian hosts. Earlier studies suggest MIP synthase as a potential drug target against which valproate was used as a drug. So, MIP synthase can be used as a target for anti-leishmanial drugs, and its inhibition may help in preventing leishmaniasis. The present study aims to identify valproate's potent analogs as drugs against MIP synthase of L. donovani (Ld-MIPS) with minimum side effects and toxicity to host.In this study, the three-dimensional structure of Ld-MIPS was built, followed by active site prediction. Ligand-based virtual screening was done using hybrid similarity recognition methods. The best 123 valproate analogs were filtered based on their quantitative structure activity relationship (QSAR) properties and were docked against Ld-MIPS using FlexX, PyRx and iGEMDOCK software. The topmost five ligands were selected for molecular dynamics simulation and pharmacokinetic analysis based on the docking score. Simulation studies up to 30 ns revealed that all five lead molecules bound with Ld-MIPS throughout MD simulation and there was no variation in their backbone. All the chosen inhibitors exhibited good pharmacokinetics/ADMET predictions with an excellent absorption profile, metabolism, oral bioavailability, solubility, excretion, and minimal toxicity, suggesting that these inhibitors may further be developed as anti-leishmaniasis drugs to prevent the spread of leishmaniasis.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Mousumi Sinha
- Department of Microbiology, Assam University, Silchar, Assam, India
| | - Rahul Jagadeesan
- CAS in Crystallography and Biophysics, Guindy Campus, University of Madras, Chennai, Tamil Nadu, India
| | - Neeraj Kumar
- Functional Genomics & Complex System Lab, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
| | - Satabdi Saha
- Department of Microbiology, Assam University, Silchar, Assam, India
| | - Gugan Kothandan
- CAS in Crystallography and Biophysics, Guindy Campus, University of Madras, Chennai, Tamil Nadu, India
| | - Diwakar Kumar
- Department of Microbiology, Assam University, Silchar, Assam, India
| |
Collapse
|
13
|
Cloning and characterization of a L-lactate dehydrogenase gene from Ruminococcaceae bacterium CPB6. World J Microbiol Biotechnol 2020; 36:182. [PMID: 33170386 DOI: 10.1007/s11274-020-02958-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Accepted: 11/03/2020] [Indexed: 10/23/2022]
Abstract
Lactate are proved to be attractive electron donor for the production of n-caproic acid (CA) that is a high value-added fuel precursor and chemical feedstock, but little is known about molecular mechanism of lactate transformation. In the present study, the gene for L-lactate dehydrogenase (LDH, EC.1.1.1.27) from a Ruminococcaceae strain CPB6 was cloned and expressed in Escherichia coli BL21 (DE3) with plasmid pET28a. The recombinant LDH exhibited molecular weight of 36-38 kDa in SDS-PAGE. The purified LDH was found to have the maximal oxidation activity of 29.6 U/mg from lactate to pyruvate at pH 6.5, and the maximal reduction activity of 10.4 U/mg from pyruvate to lactate at pH 8.5, respectively. Strikingly, its oxidative activity predominates over reductive activity, leading to a 17-fold increase for the utilization of lactate in E. coli/pET28a-LDH than E. coli/pET28a. The CPB6 LDH gene encodes a 315 amino acid protein sharing 42.19% similarity with Clostridium beijerinckii LDH, and lower similarity with LDHs of other organisms. Significant difference were observed between the CPB6 LDH and C. beijerinckii and C. acetobutylicum LDH in the predicted tertiary structure and active center. Further, X-ray crystal structure analysis need to be performed to verify the specific active center of the CPB6 LDH and its role in the conversion of lactate into CA.
Collapse
|
14
|
Sinha S, Lynn AM, Desai DK. Implementation of homology based and non-homology based computational methods for the identification and annotation of orphan enzymes: using Mycobacterium tuberculosis H37Rv as a case study. BMC Bioinformatics 2020; 21:466. [PMID: 33076816 PMCID: PMC7574302 DOI: 10.1186/s12859-020-03794-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 10/01/2020] [Indexed: 02/06/2023] Open
Abstract
Background Homology based methods are one of the most important and widely used approaches for functional annotation of high-throughput microbial genome data. A major limitation of these methods is the absence of well-characterized sequences for certain functions. The non-homology methods based on the context and the interactions of a protein are very useful for identifying missing metabolic activities and functional annotation in the absence of significant sequence similarity. In the current work, we employ both homology and context-based methods, incrementally, to identify local holes and chokepoints, whose presence in the Mycobacterium tuberculosis genome is indicated based on its interaction with known proteins in a metabolic network context, but have not been annotated. We have developed two computational procedures using network theory to identify orphan enzymes (‘Hole finding protocol’) coupled with the identification of candidate proteins for the predicted orphan enzyme (‘Hole filling protocol’). We propose an integrated interaction score based on scores from the STRING database to identify candidate protein sequences for the orphan enzymes from M. tuberculosis, as a case study, which are most likely to perform the missing function. Results The application of an automated homology-based enzyme identification protocol, ModEnzA, on M. tuberculosis genome yielded 56 novel enzyme predictions. We further predicted 74 putative local holes, 6 choke points, and 3 high confidence local holes in the genome using ‘Hole finding protocol’. The ‘Hole-filling protocol’ was validated on the E. coli genome using artificial in-silico enzyme knockouts where our method showed 25% increased accuracy, compared to other methods, in assigning the correct sequence for the knocked-out enzyme amongst the top 10 ranks. The method was further validated on 8 additional genomes. Conclusions We have developed methods that can be generalized to augment homology-based annotation to identify missing enzyme coding genes and to predict a candidate protein for them. For pathogens such as M. tuberculosis, this work holds significance in terms of increasing the protein repertoire and thereby, the potential for identifying novel drug targets.
Collapse
Affiliation(s)
- Swati Sinha
- Bioinformatics Institute, Agency for Science, Technology, and Research (A*Star), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
| | - Andrew M Lynn
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Dhwani K Desai
- Department of Biology and Department of Pharmacology, Dalhousie University, Halifax, NS, B3H4R2, Canada. .,School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India.
| |
Collapse
|
15
|
Singh G, Inoue A, Gutkind JS, Russell RB, Raimondi F. PRECOG: PREdicting COupling probabilities of G-protein coupled receptors. Nucleic Acids Res 2020; 47:W395-W401. [PMID: 31143927 PMCID: PMC6602504 DOI: 10.1093/nar/gkz392] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 04/13/2019] [Accepted: 05/01/2019] [Indexed: 01/08/2023] Open
Abstract
G-protein coupled receptors (GPCRs) control multiple physiological states by transducing a multitude of extracellular stimuli into the cell via coupling to intra-cellular heterotrimeric G-proteins. Deciphering which G-proteins couple to each of the hundreds of GPCRs present in a typical eukaryotic organism is therefore critical to understand signalling. Here, we present PRECOG (precog.russelllab.org): a web-server for predicting GPCR coupling, which allows users to: (i) predict coupling probabilities for GPCRs to individual G-proteins instead of subfamilies; (ii) visually inspect the protein sequence and structural features that are responsible for a particular coupling; (iii) suggest mutations to rationally design artificial GPCRs with new coupling properties based on predetermined coupling features.
Collapse
Affiliation(s)
- Gurdeep Singh
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany.,Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany
| | - Asuka Inoue
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - J Silvio Gutkind
- Department of Pharmacology and Moores Cancer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Robert B Russell
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany.,Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany
| | - Francesco Raimondi
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany.,Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany
| |
Collapse
|
16
|
Powell CD, Kirchoff DC, DeRouchey JE, Moseley HNB. Entropy based analysis of vertebrate sperm protamines sequences: evidence of potential dityrosine and cysteine-tyrosine cross-linking in sperm protamines. BMC Genomics 2020; 21:277. [PMID: 32245406 PMCID: PMC7126135 DOI: 10.1186/s12864-020-6681-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 03/17/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Spermatogenesis is the process by which germ cells develop into spermatozoa in the testis. Sperm protamines are small, arginine-rich nuclear proteins which replace somatic histones during spermatogenesis, allowing a hypercondensed DNA state that leads to a smaller nucleus and facilitating sperm head formation. In eutherian mammals, the protamine-DNA complex is achieved through a combination of intra- and intermolecular cysteine cross-linking and possibly histidine-cysteine zinc ion binding. Most metatherian sperm protamines lack cysteine but perform the same function. This lack of dicysteine cross-linking has made the mechanism behind metatherian protamines folding unclear. RESULTS Protamine sequences from UniProt's databases were pulled down and sorted into homologous groups. Multiple sequence alignments were then generated and a gap weighted relative entropy score calculated for each position. For the eutherian alignments, the cysteine containing positions were the most highly conserved. For the metatherian alignment, the tyrosine containing positions were the most highly conserved and corresponded to the cysteine positions in the eutherian alignment. CONCLUSIONS High conservation indicates likely functionally/structurally important residues at these positions in the metatherian protamines and the correspondence with cysteine positions within the eutherian alignment implies a similarity in function. One possible explanation is that the metatherian protamine structure relies upon dityrosine cross-linking between these highly conserved tyrosines. Also, the human protamine P1 sequence has a tyrosine substitution in a position expecting eutherian dicysteine cross-linking. Similarly, some members of the metatherian Planigales genus contain cysteine substitutions in positions expecting plausible metatherian dityrosine cross-linking. Rare cysteine-tyrosine cross-linking could explain both observations.
Collapse
Affiliation(s)
- Christian D. Powell
- Department of Chemistry, University of Kentucky, 161 Jacobs Science Building, Lexington, 40506 USA
- Markey Cancer Center, University of Kentucky, 800 Rose Street, Pavilion CC, Lexington, 40536 USA
| | - Daniel C. Kirchoff
- Department of Chemistry, University of Kentucky, 161 Jacobs Science Building, Lexington, 40506 USA
| | - Jason E. DeRouchey
- Department of Chemistry, University of Kentucky, 161 Jacobs Science Building, Lexington, 40506 USA
| | - Hunter N. B. Moseley
- Markey Cancer Center, University of Kentucky, 800 Rose Street, Pavilion CC, Lexington, 40536 USA
- Department of Molecular & Cellular Biochemistry, University of Kentucky, Lexington, 40508 USA
- Institute for Biomedical Informatics, University of Kentucky, Lexington, 40536 USA
| |
Collapse
|
17
|
Deep Analysis of Residue Constraints (DARC): identifying determinants of protein functional specificity. Sci Rep 2020; 10:1691. [PMID: 32015389 PMCID: PMC6997377 DOI: 10.1038/s41598-019-55118-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 11/23/2019] [Indexed: 01/03/2023] Open
Abstract
Protein functional constraints are manifest as superfamily and functional-subgroup conserved residues, and as pairwise correlations. Deep Analysis of Residue Constraints (DARC) aids the visualization of these constraints, characterizes how they correlate with each other and with structure, and estimates statistical significance. This can identify determinants of protein functional specificity, as we illustrate for bacterial DNA clamp loader ATPases. These load ring-shaped sliding clamps onto DNA to keep polymerase attached during replication and contain one δ, three γ, and one δ’ AAA+ subunits semi-circularly arranged in the order δ-γ1-γ2-γ3-δ’. Only γ is active, though both γ and δ’ functionally influence an adjacent γ subunit. DARC identifies, as functionally-congruent features linking allosterically the ATP, DNA, and clamp binding sites: residues distinctive of γ and of γ/δ’ that mutually interact in trans, centered on the catalytic base; several γ/δ’-residues and six γ/δ’-covariant residue pairs within the DNA binding N-termini of helices α2 and α3; and γ/δ’-residues associated with the α2 C-terminus and the clamp-binding loop. Most notable is a trans-acting γ/δ’ hydroxyl group that 99% of other AAA+ proteins lack. Mutation of this hydroxyl to a methyl group impedes clamp binding and opening, DNA binding, and ATP hydrolysis—implying a remarkably clamp-loader-specific function.
Collapse
|
18
|
Karasev D, Sobolev B, Lagunin A, Filimonov D, Poroikov V. Prediction of Protein-Ligand Interaction Based on the Positional Similarity Scores Derived from Amino Acid Sequences. Int J Mol Sci 2019; 21:ijms21010024. [PMID: 31861473 PMCID: PMC6981593 DOI: 10.3390/ijms21010024] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 12/13/2019] [Accepted: 12/16/2019] [Indexed: 12/14/2022] Open
Abstract
The affinity of different drug-like ligands to multiple protein targets reflects general chemical–biological interactions. Computational methods estimating such interactions analyze the available information about the structure of the targets, ligands, or both. Prediction of protein–ligand interactions based on pairwise sequence alignment provides reasonable accuracy if the ligands’ specificity well coincides with the phylogenic taxonomy of the proteins. Methods using multiple alignment require an accurate match of functionally significant residues. Such conditions may not be met in the case of diverged protein families. To overcome these limitations, we propose an approach based on the analysis of local sequence similarity within the set of analyzed proteins. The positional scores, calculated by sequence fragment comparisons, are used as input data for the Bayesian classifier. Our approach provides a prediction accuracy comparable or exceeding those of other methods. It was demonstrated on the popular Gold Standard test sets, presenting different sequence heterogeneity and varying from the group, including different protein families to the more specific groups. A reasonable prediction accuracy was also found for protein kinases, displaying weak relationships between sequence phylogeny and inhibitor specificity. Thus, our method can be applied to the broad area of protein–ligand interactions.
Collapse
Affiliation(s)
- Dmitry Karasev
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow 119121, Russia; (B.S.); (A.L.); (D.F.); (V.P.)
- Correspondence:
| | - Boris Sobolev
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow 119121, Russia; (B.S.); (A.L.); (D.F.); (V.P.)
| | - Alexey Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow 119121, Russia; (B.S.); (A.L.); (D.F.); (V.P.)
- Department of Bioinformatics, Russian National Research Medical University, Moscow 117997, Russia
| | - Dmitry Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow 119121, Russia; (B.S.); (A.L.); (D.F.); (V.P.)
| | - Vladimir Poroikov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow 119121, Russia; (B.S.); (A.L.); (D.F.); (V.P.)
| |
Collapse
|
19
|
Kim D, Han SK, Lee K, Kim I, Kong J, Kim S. Evolutionary coupling analysis identifies the impact of disease-associated variants at less-conserved sites. Nucleic Acids Res 2019; 47:e94. [PMID: 31199866 PMCID: PMC6895274 DOI: 10.1093/nar/gkz536] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 05/03/2019] [Accepted: 06/05/2019] [Indexed: 12/20/2022] Open
Abstract
Genome-wide association studies have discovered a large number of genetic variants in human patients with the disease. Thus, predicting the impact of these variants is important for sorting disease-associated variants (DVs) from neutral variants. Current methods to predict the mutational impacts depend on evolutionary conservation at the mutation site, which is determined using homologous sequences and based on the assumption that variants at well-conserved sites have high impacts. However, many DVs at less-conserved but functionally important sites cannot be predicted by the current methods. Here, we present a method to find DVs at less-conserved sites by predicting the mutational impacts using evolutionary coupling analysis. Functionally important and evolutionarily coupled sites often have compensatory variants on cooperative sites to avoid loss of function. We found that our method identified known intolerant variants in a diverse group of proteins. Furthermore, at less-conserved sites, we identified DVs that were not identified using conservation-based methods. These newly identified DVs were frequently found at protein interaction interfaces, where species-specific mutations often alter interaction specificity. This work presents a means to identify less-conserved DVs and provides insight into the relationship between evolutionarily coupled sites and human DVs.
Collapse
Affiliation(s)
- Donghyo Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| | - Seong Kyu Han
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| | - Kwanghwan Lee
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| | - Inhae Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| | - JungHo Kong
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| | - Sanguk Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang 790-784, Korea
| |
Collapse
|
20
|
Inoue A, Raimondi F, Kadji FMN, Singh G, Kishi T, Uwamizu A, Ono Y, Shinjo Y, Ishida S, Arang N, Kawakami K, Gutkind JS, Aoki J, Russell RB. Illuminating G-Protein-Coupling Selectivity of GPCRs. Cell 2019; 177:1933-1947.e25. [PMID: 31160049 DOI: 10.1016/j.cell.2019.04.044] [Citation(s) in RCA: 410] [Impact Index Per Article: 68.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 01/28/2019] [Accepted: 04/25/2019] [Indexed: 12/20/2022]
Abstract
Heterotrimetic G proteins consist of four subfamilies (Gs, Gi/o, Gq/11, and G12/13) that mediate signaling via G-protein-coupled receptors (GPCRs), principally by receptors binding Gα C termini. G-protein-coupling profiles govern GPCR-induced cellular responses, yet receptor sequence selectivity determinants remain elusive. Here, we systematically quantified ligand-induced interactions between 148 GPCRs and all 11 unique Gα subunit C termini. For each receptor, we probed chimeric Gα subunit activation via a transforming growth factor-α (TGF-α) shedding response in HEK293 cells lacking endogenous Gq/11 and G12/13 proteins, and complemented G-protein-coupling profiles through a NanoBiT-G-protein dissociation assay. Interrogation of the dataset identified sequence-based coupling specificity features, inside and outside the transmembrane domain, which we used to develop a coupling predictor that outperforms previous methods. We used the predictor to engineer designer GPCRs selectively coupled to G12. This dataset of fine-tuned signaling mechanisms for diverse GPCRs is a valuable resource for research in GPCR signaling.
Collapse
Affiliation(s)
- Asuka Inoue
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan; Advanced Research & Development Programs for Medical Innovation (PRIME), Japan Agency for Medical Research and Development (AMED), Chiyoda-ku, Tokyo 100-0004, Japan; Advanced Research & Development Programs for Medical Innovation (LEAP), AMED, Chiyoda-ku, Tokyo 100-0004, Japan.
| | - Francesco Raimondi
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany; Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany.
| | | | - Gurdeep Singh
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany; Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany
| | - Takayuki Kishi
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Akiharu Uwamizu
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Yuki Ono
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Yuji Shinjo
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Satoru Ishida
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Nadia Arang
- Department of Pharmacology and Moores Cancer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Kouki Kawakami
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - J Silvio Gutkind
- Department of Pharmacology and Moores Cancer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Junken Aoki
- Graduate School of Pharmaceutical Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan; Advanced Research & Development Programs for Medical Innovation (LEAP), AMED, Chiyoda-ku, Tokyo 100-0004, Japan
| | - Robert B Russell
- CellNetworks, Bioquant, Heidelberg University, Im Neuenheimer Feld 267, 69120 Heidelberg, Germany; Biochemie Zentrum Heidelberg (BZH), Heidelberg University, Im Neuenheimer Feld 328, 69120 Heidelberg, Germany.
| |
Collapse
|
21
|
Banerjee A, Vishwakarma P, Kumar A, Lynn AM, Prasad R. Information theoretic measures and mutagenesis identify a novel linchpin residue involved in substrate selection within the nucleotide-binding domain of an ABCG family exporter Cdr1p. Arch Biochem Biophys 2019; 663:143-150. [DOI: 10.1016/j.abb.2019.01.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 12/23/2018] [Accepted: 01/12/2019] [Indexed: 10/27/2022]
|
22
|
Gil N, Fiser A. The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis. Bioinformatics 2019; 35:12-19. [PMID: 29947739 PMCID: PMC6298051 DOI: 10.1093/bioinformatics/bty523] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 04/20/2018] [Accepted: 06/26/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| |
Collapse
|
23
|
Płuciennik A, Stolarczyk M, Bzówka M, Raczyńska A, Magdziarz T, Góra A. BALCONY: an R package for MSA and functional compartments of protein variability analysis. BMC Bioinformatics 2018; 19:300. [PMID: 30107777 PMCID: PMC6092823 DOI: 10.1186/s12859-018-2294-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Accepted: 07/23/2018] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Here, we present an R package for entropy/variability analysis that facilitates prompt and convenient data extraction, manipulation and visualization of protein features from multiple sequence alignments. BALCONY can work with residues dispersed across a protein sequence and map them on the corresponding alignment of homologous protein sequences. Additionally, it provides several entropy and variability scores that indicate the conservation of each residue. RESULTS Our package allows the user to visualize evolutionary variability by locating the positions most likely to vary and to assess mutation candidates in protein engineering. CONCLUSION In comparison to other R packages BALCONY allows conservation/variability analysis in context of protein structure with linkage of the appropriate metrics with physicochemical features of user choice. AVAILABILITY CRAN project page: https://cran.r-project.org/package=BALCONY and our website: http://www.tunnelinggroup.pl/software/ for major platforms: Linux/Unix, Windows and Mac OS X.
Collapse
Affiliation(s)
- Alicja Płuciennik
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland.,Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Michał Stolarczyk
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland.,Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Maria Bzówka
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland.,Faculty of Chemistry, Silesian University of Technology, ks. Marcina Strzody 9, 44-100, Gliwice, Poland
| | - Agata Raczyńska
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland.,Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Tomasz Magdziarz
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland
| | - Artur Góra
- Tunneling Group, Biotechnology Centre, Silesian University of Technology, ul. Krzywoustego 8, 44-100, Gliwice, Poland.
| |
Collapse
|
24
|
Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules. BMC Res Notes 2018; 11:290. [PMID: 29751818 PMCID: PMC5948687 DOI: 10.1186/s13104-018-3383-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 04/30/2018] [Indexed: 02/06/2023] Open
Abstract
Objectives The arrival of free oxygen on the globe, aerobic life is becoming possible. However, it has become very clear that the oxygen binding proteins are widespread in the biosphere and are found in all groups of organisms, including prokaryotes, eukaryotes as well as in fungi, plants, and animals. The exponential growth and availability of fresh annotated protein sequences in the databases motivated us to develop an improved version of “Oxypred” for identifying oxygen-binding proteins. Results In this study, we have proposed a method for identifying oxy-proteins with two different sequence similarity cutoffs 50 and 90%. A different amino acid composition based Support Vector Machines models was developed, including the evolutionary profiles in the form position-specific scoring matrix (PSSM). The fivefold cross-validation techniques were applied to evaluate the prediction performance. Also, we compared with existing methods, which shows nearly 97% recognition, but, our newly developed models were able to recognize almost 99.99 and 100% in both oxy-50 and 90% similarity models respectively. Our result shows that our approaches are faster and achieve a better prediction performance over the existing methods. The web-server Oxypred2 was developed for an alternative method for identifying oxy-proteins with more additional modules including PSSM, available at http://bioinfo.imtech.res.in/servers/muthu/oxypred2/home.html. Electronic supplementary material The online version of this article (10.1186/s13104-018-3383-9) contains supplementary material, which is available to authorized users.
Collapse
|
25
|
Karasev DA, Veselovsky AV, Lagunin AA, Filimonov DA, Sobolev BN. Determination of Amino Acid Residues Responsible for Specific Interaction of Protein Kinases with Small Molecule Inhibitors. Mol Biol 2018. [DOI: 10.1134/s002689331802005x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
26
|
Garrido-Martín D, Pazos F. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues. BMC Bioinformatics 2018; 19:67. [PMID: 29482506 PMCID: PMC5827975 DOI: 10.1186/s12859-018-2084-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 02/21/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. RESULTS In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. CONCLUSIONS These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Collapse
Affiliation(s)
- Diego Garrido-Martín
- Present address: Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, c/ Dr. Aiguader, 88, 08003, Barcelona, Spain.,Present address: Universitat Pompeu Fabra (UPF), Plaça de la Mercè, 10-12, 08002, Barcelona, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, Systems Biology Program, National Centre for Biotechnology (CNB-CSIC), c/ Darwin, 3, 28049, Madrid, Spain.
| |
Collapse
|
27
|
Kalaivani R, Reema R, Srinivasan N. Recognition of sites of functional specialisation in all known eukaryotic protein kinase families. PLoS Comput Biol 2018; 14:e1005975. [PMID: 29438395 PMCID: PMC5826538 DOI: 10.1371/journal.pcbi.1005975] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2017] [Revised: 02/26/2018] [Accepted: 01/13/2018] [Indexed: 11/25/2022] Open
Abstract
The conserved function of protein phosphorylation, catalysed by members of protein kinase superfamily, is regulated in different ways in different kinase families. Further, differences in activating triggers, cellular localisation, domain architecture and substrate specificity between kinase families are also well known. While the transfer of γ-phosphate from ATP to the hydroxyl group of Ser/Thr/Tyr is mediated by a conserved Asp, the characteristic functional and regulatory sites are specialized at the level of families or sub-families. Such family-specific sites of functional specialization are unknown for most families of kinases. In this work, we systematically identify the family-specific residue features by comparing the extent of conservation of physicochemical properties, Shannon entropy and statistical probability of residue distributions between families of kinases. An integrated discriminatory score, which combines these three features, is developed to demarcate the functionally specialized sites in a kinase family from other sites. We achieved an area under ROC curve of 0.992 for the discrimination of kinase families. Our approach was extensively tested on well-studied families CDK and MAPK, wherein specific protein interaction sites and substrate recognition sites were successfully detected (p-value < 0.05). We also find that the known family-specific oncogenic driver mutation sites were scored high by our method. The method was applied to all known kinases encompassing 107 families from diverse eukaryotic organisms leading to a comprehensive list of family-specific functional sites. Apart from other uses, our method facilitates identification of specific protein interaction sites and drug target sites in a kinase family.
Collapse
Affiliation(s)
- Raju Kalaivani
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka, India
| | - Raju Reema
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka, India
| | | |
Collapse
|
28
|
Molano EPL, Cabrera OG, Jose J, do Nascimento LC, Carazzolle MF, Teixeira PJPL, Alvarez JC, Tiburcio RA, Tokimatu Filho PM, de Lima GMA, Guido RVC, Corrêa TLR, Leme AFP, Mieczkowski P, Pereira GAG. Ceratocystis cacaofunesta genome analysis reveals a large expansion of extracellular phosphatidylinositol-specific phospholipase-C genes (PI-PLC). BMC Genomics 2018; 19:58. [PMID: 29343217 PMCID: PMC5773145 DOI: 10.1186/s12864-018-4440-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Accepted: 01/08/2018] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Ceratocystis genus harbors a large number of phytopathogenic fungi that cause xylem parenchyma degradation and vascular destruction on a broad range of economically important plants. Ceratocystis cacaofunesta is a necrotrophic fungus responsible for lethal wilt disease in cacao. The aim of this work is to analyze the genome of C. cacaofunesta through a comparative approach with genomes of other Sordariomycetes in order to better understand the molecular basis of pathogenicity in the Ceratocystis genus. RESULTS We present an analysis of the C. cacaofunesta genome focusing on secreted proteins that might constitute pathogenicity factors. Comparative genome analyses among five Ceratocystidaceae species and 23 other Sordariomycetes fungi showed a strong reduction in gene content of the Ceratocystis genus. However, some gene families displayed a remarkable expansion, in particular, the Phosphatidylinositol specific phospholipases-C (PI-PLC) family. Also, evolutionary rate calculations suggest that the evolution process of this family was guided by positive selection. Interestingly, among the 82 PI-PLCs genes identified in the C. cacaofunesta genome, 70 genes encoding extracellular PI-PLCs are grouped in eight small scaffolds surrounded by transposon fragments and scars that could be involved in the rapid evolution of the PI-PLC family. Experimental secretome using LC-MS/MS validated 24% (86 proteins) of the total predicted secretome (342 proteins), including four PI-PLCs and other important pathogenicity factors. CONCLUSION Analysis of the Ceratocystis cacaofunesta genome provides evidence that PI-PLCs may play a role in pathogenicity. Subsequent functional studies will be aimed at evaluating this hypothesis. The observed genetic arsenals, together with the analysis of the PI-PLC family shown in this work, reveal significant differences in the Ceratocystis genome compared to the classical vascular fungi, Verticillium and Fusarium. Altogether, our analyses provide new insights into the evolution and the molecular basis of plant pathogenicity.
Collapse
Affiliation(s)
- Eddy Patricia Lopez Molano
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | - Odalys García Cabrera
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | - Juliana Jose
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | | | - Marcelo Falsarella Carazzolle
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil.,Centro Nacional de Processamento de Alto Desempenho, Universidade Estadual de Campinas, Campinas, Brazil
| | - Paulo José Pereira Lima Teixeira
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil.,Present Address: Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Javier Correa Alvarez
- Departamento de Ciencias Biológicas, Escuela de Ciencias, Universidad EAFIT, Medellın, Colombia
| | - Ricardo Augusto Tiburcio
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | - Paulo Massanari Tokimatu Filho
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | - Gustavo Machado Alvares de Lima
- Centro de Biotecnologia Molecular Estrutural, Instituto de Física de São Carlos, Universidade de São Paulo, São Paulo, Brazil
| | - Rafael Victório Carvalho Guido
- Centro de Biotecnologia Molecular Estrutural, Instituto de Física de São Carlos, Universidade de São Paulo, São Paulo, Brazil
| | - Thamy Lívia Ribeiro Corrêa
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil
| | | | - Piotr Mieczkowski
- High-Throughput Sequencing Facility, University of North Carolina, Chapel Hill, NC, USA
| | - Gonçalo Amarante Guimarães Pereira
- Genomic and Expression Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, University of Campinas, Campinas, SP, 13083-970, Brazil.
| |
Collapse
|
29
|
Neuwald AF, Aravind L, Altschul SF. Inferring joint sequence-structural determinants of protein functional specificity. eLife 2018; 7. [PMID: 29336305 PMCID: PMC5770160 DOI: 10.7554/elife.29880] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Accepted: 12/22/2017] [Indexed: 01/05/2023] Open
Abstract
Residues responsible for allostery, cooperativity, and other subtle but functionally important interactions remain difficult to detect. To aid such detection, we employ statistical inference based on the assumption that residues distinguishing a protein subgroup from evolutionarily divergent subgroups often constitute an interacting functional network. We identify such networks with the aid of two measures of statistical significance. One measure aids identification of divergent subgroups based on distinguishing residue patterns. For each subgroup, a second measure identifies structural interactions involving pattern residues. Such interactions are derived either from atomic coordinates or from Direct Coupling Analysis scores, used as surrogates for structural distances. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. These and similar analyses can aid the design of drugs targeting allosteric sites.
Collapse
Affiliation(s)
- Andrew F Neuwald
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, United States.,Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, United States
| | - L Aravind
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, United States
| | - Stephen F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, United States
| |
Collapse
|
30
|
Slama P. Two-domain analysis of JmjN-JmjC and PHD-JmjC lysine demethylases: Detecting an inter-domain evolutionary stress. Proteins 2017; 86:3-12. [PMID: 28975662 DOI: 10.1002/prot.25394] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Revised: 09/26/2017] [Accepted: 10/03/2017] [Indexed: 11/09/2022]
Abstract
Residues at different positions of a multiple sequence alignment sometimes evolve together, due to a correlated structural or functional stress at these positions. Co-evolution has thus been evidenced computationally in multiple proteins or protein domains. Here, we wish to study whether an evolutionary stress is exerted on a sequence alignment across protein domains, i.e., on longer sequence separations than within a single protein domain. JmjC-containing lysine demethylases were chosen for analysis, as a follow-up to previous studies; these proteins are important multidomain epigenetic regulators. In these proteins, the JmjC domain is responsible for the demethylase activity, and surrounding domains interact with histones, DNA or partner proteins. This family of enzymes was analyzed at the sequence level, in order to determine whether the sequence of JmjC-domains was affected by the presence of a neighboring JmjN domain or PHD finger in the protein. Multiple positions within JmjC sequences were shown to have their residue distributions significantly altered by the presence of the second domain. Structural considerations confirmed the relevance of the analysis for JmjN-JmjC proteins, while among PHD-JmjC proteins, the length of the linker region could be correlated to the residues observed at the most affected positions. The correlation of domain architecture with residue types at certain positions, as well as that of overall architecture with protein function, is discussed. The present results thus evidence the existence of an across-domain evolutionary stress in JmjC-containing demethylases, and provide further insights into the overall domain architecture of JmjC domain-containing proteins.
Collapse
Affiliation(s)
- Patrick Slama
- Independent researcher, Paris, France; Center for Imaging Science, the Johns Hopkins University, Clark Hall, 3400 N Charles Street, Baltimore, Maryland, 21218
| |
Collapse
|
31
|
Nagata S, Imai J, Makino G, Tomita M, Kanai A. Evolutionary Analysis of HIV-1 Pol Proteins Reveals Representative Residues for Viral Subtype Differentiation. Front Microbiol 2017; 8:2151. [PMID: 29163435 PMCID: PMC5666293 DOI: 10.3389/fmicb.2017.02151] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 10/20/2017] [Indexed: 11/15/2022] Open
Abstract
RNA viruses have been used as model systems to understand the patterns and processes of molecular evolution because they have high mutation rates and are genetically diverse. Human immunodeficiency virus 1 (HIV-1), the etiological agent of acquired immune deficiency syndrome, is highly genetically diverse, and is classified into several groups and subtypes. However, it has been difficult to use its diverse sequences to establish the overall phylogenetic relationships of different strains or the trends in sequence conservation with the construction of phylogenetic trees. Our aims were to systematically characterize HIV-1 subtype evolution and to identify the regions responsible for HIV-1 subtype differentiation at the amino acid level in the Pol protein, which is often used to classify the HIV-1 subtypes. In this study, we systematically characterized the mutation sites in 2,052 Pol proteins from HIV-1 group M (144 subtype A; 1,528 subtype B; 380 subtype C), using sequence similarity networks. We also used spectral clustering to group the sequences based on the network graph structures. A stepwise analysis of the cluster hierarchies allowed us to estimate a possible evolutionary pathway for the Pol proteins. The subtype A sequences also clustered according to when and where the viruses were isolated, whereas both the subtype B and C sequences remained as single clusters. Because the Pol protein has several functional domains, we identified the regions that are discriminative by comparing the structures of the domain-based networks. Our results suggest that sequence changes in the RNase H domain and the reverse transcriptase (RT) connection domain are responsible for the subtype classification. By analyzing the different amino acid compositions at each site in both domain sequences, we found that a few specific amino acid residues (i.e., M357 in the RT connection domain and Q480, Y483, and L491 in the RNase H domain) represent the differences among the subtypes. These residues were located on the surface of the RT structure and in the vicinity of the amino acid sites responsible for RT enzymatic activity or function.
Collapse
Affiliation(s)
- Shohei Nagata
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan.,Faculty of Environment and Information Studies, Keio University, Fujisawa, Japan
| | - Junnosuke Imai
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan.,Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa, Japan
| | - Gakuto Makino
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan.,Faculty of Environment and Information Studies, Keio University, Fujisawa, Japan.,Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa, Japan
| | - Akio Kanai
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan.,Faculty of Environment and Information Studies, Keio University, Fujisawa, Japan.,Systems Biology Program, Graduate School of Media and Governance, Keio University, Fujisawa, Japan
| |
Collapse
|
32
|
Effective estimation of the minimum number of amino acid residues required for functional divergence between duplicate genes. Mol Phylogenet Evol 2017; 113:126-138. [DOI: 10.1016/j.ympev.2017.05.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2017] [Revised: 03/19/2017] [Accepted: 05/10/2017] [Indexed: 01/10/2023]
|
33
|
Neuwald AF, Altschul SF. Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 2016; 12:e1005294. [PMID: 28002465 PMCID: PMC5225019 DOI: 10.1371/journal.pcbi.1005294] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 01/10/2017] [Accepted: 12/08/2016] [Indexed: 11/25/2022] Open
Abstract
Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes’ theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu). Protein sequence data, when gathered in great quantity, contain important but implicit biological information manifest as statistical correlations. Here we describe an approach to access this information by comprehensively modeling and characterizing the distribution of sequences belonging to a major protein superfamily. This approach takes as input a large set of unaligned sequences belonging to the superfamily. By applying the minimum description length principle, it seeks the statistical model that best explains the sequences while avoiding over-fitting the data. It concurrently aligns the sequences and, to model evolutionary divergence, partitions them into subgroups that are hierarchically-arranged based upon correlated residue patterns. Auxiliary routines create PyMOL scripts to visualize the locations of correlated residues within available structures. Because these correlations likely arise from structural and biochemical constraints, they can help elucidate protein properties important for functional specificity. Comparing and contrasting sequence and structural features in this way may therefore suggest, in the light of published studies, plausible biological hypotheses for experimental investigation. We illustrate this approach with N-acetyltransferases.
Collapse
Affiliation(s)
- Andrew F. Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, Baltimore, MD, United States of America
- * E-mail:
| | - Stephen F. Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
| |
Collapse
|
34
|
Sloutsky R, Naegle KM. High-Resolution Identification of Specificity Determining Positions in the LacI Protein Family Using Ensembles of Sub-Sampled Alignments. PLoS One 2016; 11:e0162579. [PMID: 27681038 PMCID: PMC5040260 DOI: 10.1371/journal.pone.0162579] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2016] [Accepted: 08/08/2016] [Indexed: 01/24/2023] Open
Abstract
Since the advent of large-scale genomic sequencing, and the consequent availability of large numbers of homologous protein sequences, there has been burgeoning development of methods for extracting functional information from multiple sequence alignments (MSAs). One type of analysis seeks to identify specificity determining positions (SDPs) based on the assumption that such positions are highly conserved within groups of sequences sharing functional specificity, but conserved to different amino acids in different specificity groups. This unsupervised approach to utilizing evolutionary information may elucidate mechanisms of specificity in protein-protein interactions, catalytic activity of enzymes, sensitivity to allosteric regulation, and other types of protein functionality. We present an analysis of SDPs in the LacI family of transcriptional regulators in which we 1) relax the constraint that all specificity groups must contribute to SDP signal, and 2) use a novel approach to robust treatment of sequence alignment uncertainty based on sub-sampling. We find that the vast majority of SDP signal occurs at positions with a conservation pattern that significantly complicates detection by previously described methods. This pattern, which we term “partial SDP”, consists of the commonly accepted SDP conservation pattern among a subset of specificity groups and strong degeneracy among the rest. An upshot of this fact is that the SDP complement of every specificity group appears to be unique. Additionally, sub-sampling gives us the ability to assign a confidence interval to the SDP score, as well as increase fidelity, as compared to analysis of a single, comprehensive alignment—the current standard in multiple sequence alignment methodologies.
Collapse
Affiliation(s)
- Roman Sloutsky
- Biomedical Engineering Department, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
| | - Kristen M. Naegle
- Biomedical Engineering Department, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- * E-mail:
| |
Collapse
|
35
|
Boari de Lima E, Meira W, de Melo-Minardi RC. Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering. PLoS Comput Biol 2016; 12:e1005001. [PMID: 27348631 PMCID: PMC4922564 DOI: 10.1371/journal.pcbi.1005001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 05/22/2016] [Indexed: 01/14/2023] Open
Abstract
As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.
Collapse
Affiliation(s)
- Elisa Boari de Lima
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Wagner Meira
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | |
Collapse
|
36
|
Schwarz RF, Tamuri AU, Kultys M, King J, Godwin J, Florescu AM, Schultz J, Goldman N. ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments. Nucleic Acids Res 2016; 44:e77. [PMID: 26819408 PMCID: PMC4856975 DOI: 10.1093/nar/gkw022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 01/08/2016] [Indexed: 12/19/2022] Open
Abstract
Sequence Logos and its variants are the most commonly used method for visualization of multiple sequence alignments (MSAs) and sequence motifs. They provide consensus-based summaries of the sequences in the alignment. Consequently, individual sequences cannot be identified in the visualization and covariant sites are not easily discernible. We recently proposed Sequence Bundles, a motif visualization technique that maintains a one-to-one relationship between sequences and their graphical representation and visualizes covariant sites. We here present Alvis, an open-source platform for the joint explorative analysis of MSAs and phylogenetic trees, employing Sequence Bundles as its main visualization method. Alvis combines the power of the visualization method with an interactive toolkit allowing detection of covariant sites, annotation of trees with synapomorphies and homoplasies, and motif detection. It also offers numerical analysis functionality, such as dimension reduction and classification. Alvis is user-friendly, highly customizable and can export results in publication-quality figures. It is available as a full-featured standalone version (http://www.bitbucket.org/rfs/alvis) and its Sequence Bundles visualization module is further available as a web application (http://science-practice.com/projects/sequence-bundles).
Collapse
Affiliation(s)
- Roland F Schwarz
- European Molecular Biology Laboratory-European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Asif U Tamuri
- European Molecular Biology Laboratory-European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Marek Kultys
- Science Practice, 83-85 Paul Street, London, EC2A 4NQ, UK
| | - James King
- Science Practice, 83-85 Paul Street, London, EC2A 4NQ, UK
| | - James Godwin
- Science Practice, 83-85 Paul Street, London, EC2A 4NQ, UK
| | - Ana M Florescu
- Science Practice, 83-85 Paul Street, London, EC2A 4NQ, UK
| | - Jörg Schultz
- Center for Computational and Theoretical Biology and Department of Bioinformatics, University of Würzburg, Biocenter, Am Hubland, 97074 Würzburg, Germany
| | - Nick Goldman
- European Molecular Biology Laboratory-European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
37
|
Das S, Orengo CA. Protein function annotation using protein domain family resources. Methods 2016; 93:24-34. [DOI: 10.1016/j.ymeth.2015.09.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Revised: 09/28/2015] [Accepted: 09/29/2015] [Indexed: 01/25/2023] Open
|
38
|
Karasev DA, Veselovsky AV, Oparina NY, Filimonov DA, Sobolev BN. Prediction of amino acid positions specific for functional groups in a protein family based on local sequence similarity. J Mol Recognit 2015; 29:159-69. [DOI: 10.1002/jmr.2515] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 09/28/2015] [Accepted: 09/30/2015] [Indexed: 01/24/2023]
Affiliation(s)
- Dmitry A. Karasev
- Russian National Research Medical University; Moscow Russia
- Laboratory of Structure-Function Based Drug Design; Institute of Biomedical Chemistry (IBMC); Moscow Russia
| | - Alexander V. Veselovsky
- Laboratory of Structure Bioinformatics; Institute of Biomedical Chemistry (IBMC); Moscow Russia
| | - Nina Yu. Oparina
- Department of Medical Biochemistry and Microbiology; Uppsala University; Uppsala Sweden
- Engelhardt Institute of Molecular Biology; Moscow Russia
| | - Dmitry A. Filimonov
- Laboratory of Structure Bioinformatics; Institute of Biomedical Chemistry (IBMC); Moscow Russia
| | - Boris N. Sobolev
- Laboratory of Structure-Function Based Drug Design; Institute of Biomedical Chemistry (IBMC); Moscow Russia
| |
Collapse
|
39
|
Chagoyen M, García-Martín JA, Pazos F. Practical analysis of specificity-determining residues in protein families. Brief Bioinform 2015; 17:255-61. [DOI: 10.1093/bib/bbv045] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Accepted: 06/15/2015] [Indexed: 12/17/2022] Open
|
40
|
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 2015; 31:3460-7. [PMID: 26139634 PMCID: PMC4612221 DOI: 10.1093/bioinformatics/btv398] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Accepted: 06/24/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact:sayoni.das.12@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
41
|
Sinha S, Lynn AM. HMM-ModE: implementation, benchmarking and validation with HMMER3. BMC Res Notes 2014; 7:483. [PMID: 25073805 PMCID: PMC4236727 DOI: 10.1186/1756-0500-7-483] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 07/21/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation. RESULTS The implementation of this method, which now works with HMMER3, is benchmarked with the earlier version of HMMER, to show that the effect of local-local alignments is marked only in the case of profiles containing a large number of discontinuous match states. The method is tested on a gold standard set of families and we have reported a significant reduction in the number of false positive hits over the default HMM profiles. When implemented on GPCR sequences, the results showed an improvement in the accuracy of classification compared with other methods used to classify the familyat different levels of their classification hierarchy. CONCLUSIONS The present findings show that the new version of HMM-ModE is a highly specific method used to differentiate between fold (superfamily) and function (family) specific signals, which helps in the functional annotation of protein sequences. The use of modified profile HMMs of GPCR sequences provides a simple yet highly specific method for classification of the family, being able to predict the sub-family specific sequences with high accuracy even though sequences share common physicochemical characteristics between sub-families.
Collapse
Affiliation(s)
| | - Andrew Michael Lynn
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| |
Collapse
|
42
|
Sandler I, Zigdon N, Levy E, Aharoni A. The functional importance of co-evolving residues in proteins. Cell Mol Life Sci 2014; 71:673-82. [PMID: 23995987 PMCID: PMC11113390 DOI: 10.1007/s00018-013-1458-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2013] [Revised: 07/26/2013] [Accepted: 08/13/2013] [Indexed: 10/26/2022]
Abstract
Computational approaches for detecting co-evolution in proteins allow for the identification of protein-protein interaction networks in different organisms and the assignment of function to under-explored proteins. The detection of co-variation of amino acids within or between proteins, moreover, allows for the discovery of residue-residue contacts and highlights functional residues that can affect the binding affinity, catalytic activity, or substrate specificity of a protein. To explore the functional impact of co-evolutionary changes in proteins, a combined experimental and computational approach must be recruited. Here, we review recent studies that apply computational and experimental tools to obtain novel insight into the structure, function, and evolution of proteins. Specifically, we describe the application of co-evolutionary analysis for predicting high-resolution three-dimensional structures of proteins. In addition, we describe computational approaches followed by experimental analysis for identifying specificity-determining residues in proteins. Finally, we discuss studies addressing the importance of such residues in terms of the functional divergence of proteins, allowing proteins to evolve new functions while avoiding crosstalk with existing cellular pathways or forming reproductive barriers and hence promoting speciation.
Collapse
Affiliation(s)
- Inga Sandler
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Nitzan Zigdon
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Efrat Levy
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| | - Amir Aharoni
- Department of Life Sciences, Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
- National Institute for Biotechnology in the Negev (NIBN), Ben-Gurion University of the Negev, 84105 Be’er Sheva, Israel
| |
Collapse
|
43
|
Baranašić D, Zucko J, Diminic J, Gacesa R, Long PF, Cullum J, Hranueli D, Starcevic A. Predicting substrate specificity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing. ACTA ACUST UNITED AC 2014; 41:461-7. [DOI: 10.1007/s10295-013-1322-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Accepted: 08/03/2013] [Indexed: 11/24/2022]
Abstract
Abstract
Successful genome mining is dependent on accurate prediction of protein function from sequence. This often involves dividing protein families into functional subtypes (e.g., with different substrates). In many cases, there are only a small number of known functional subtypes, but in the case of the adenylation domains of nonribosomal peptide synthetases (NRPS), there are >500 known substrates. Latent semantic indexing (LSI) was originally developed for text processing but has also been used to assign proteins to families. Proteins are treated as ‘‘documents’’ and it is necessary to encode properties of the amino acid sequence as ‘‘terms’’ in order to construct a term-document matrix, which counts the terms in each document. This matrix is then processed to produce a document-concept matrix, where each protein is represented as a row vector. A standard measure of the closeness of vectors to each other (cosines of the angle between them) provides a measure of protein similarity. Previous work encoded proteins as oligopeptide terms, i.e. counted oligopeptides, but used no information regarding location of oligopeptides in the proteins. A novel tokenization method was developed to analyze information from multiple alignments. LSI successfully distinguished between two functional subtypes in five well-characterized families. Visualization of different ‘‘concept’’ dimensions allows exploration of the structure of protein families. LSI was also used to predict the amino acid substrate of adenylation domains of NRPS. Better results were obtained when selected residues from multiple alignments were used rather than the total sequence of the adenylation domains. Using ten residues from the substrate binding pocket performed better than using 34 residues within 8 Å of the active site. Prediction efficiency was somewhat better than that of the best published method using a support vector machine.
Collapse
Affiliation(s)
- Damir Baranašić
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
- grid.7645.0 0000000121550333 Department of Genetics University of Kaiserslautern Postfach 3049 67653 Kaiserslautern Germany
| | - Jurica Zucko
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
| | - Janko Diminic
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
| | - Ranko Gacesa
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
| | - Paul F Long
- grid.13097.3c 0000000123226764 Institute of Pharmaceutical Science, King’s College London Franklin–Wilkins Building, 150 Stamford Street London SE1 9NH UK
- grid.13097.3c 0000000123226764 Department of Chemistry King’s College London Franklin–Wilkins Building, 150 Stamford Street SE1 9NH London UK
| | - John Cullum
- grid.7645.0 0000000121550333 Department of Genetics University of Kaiserslautern Postfach 3049 67653 Kaiserslautern Germany
| | - Daslav Hranueli
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
| | - Antonio Starcevic
- grid.4808.4 0000000106574636 Faculty of Food Technology and Biotechnology University of Zagreb Pierottijeva 6 10000 Zagreb Croatia
| |
Collapse
|
44
|
Chakraborty A, Chakrabarti S. A survey on prediction of specificity-determining sites in proteins. Brief Bioinform 2014; 16:71-88. [DOI: 10.1093/bib/bbt092] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
45
|
Employing directed evolution for the functional analysis of multi-specific proteins. Bioorg Med Chem 2013; 21:3511-6. [DOI: 10.1016/j.bmc.2013.04.052] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 04/11/2013] [Accepted: 04/18/2013] [Indexed: 01/17/2023]
|
46
|
Gu X, Zou Y, Su Z, Huang W, Zhou Z, Arendsee Z, Zeng Y. An update of DIVERGE software for functional divergence analysis of protein family. Mol Biol Evol 2013; 30:1713-9. [PMID: 23589455 DOI: 10.1093/molbev/mst069] [Citation(s) in RCA: 146] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change at a particular amino acid site may contain insightful information about the underlying functional divergence following gene duplication. These predicted sites may be used as candidates for further experiments. We are now releasing an updated version of DIVERGE with the following improvements: 1) a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); 2) the calculation of the false discovery rate of functionally diverging sites; 3) estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs; 4) a statistical test for asymmetric functional divergence; and 5) a new method to infer functional divergence specific to a given duplicate cluster. In addition, we have made efforts to improve software design and produce a well-written software manual for the general user.
Collapse
Affiliation(s)
- Xun Gu
- State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, China.
| | | | | | | | | | | | | |
Collapse
|
47
|
Kitsche A, Kalesse M. Configurational Assignment of Secondary Hydroxyl Groups and Methyl Branches in Polyketide Natural Products through Bioinformatic Analysis of the Ketoreductase Domain. Chembiochem 2013; 14:851-61. [DOI: 10.1002/cbic.201300063] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Indexed: 12/17/2022]
|
48
|
Abstract
Co-evolution is a fundamental component of the theory of evolution and is essential for understanding the relationships between species in complex ecological networks. A wide range of co-evolution-inspired computational methods has been designed to predict molecular interactions, but it is only recently that important advances have been made. Breakthroughs in the handling of phylogenetic information and in disentangling indirect relationships have resulted in an improved capacity to predict interactions between proteins and contacts between different protein residues. Here, we review the main co-evolution-based computational approaches, their theoretical basis, potential applications and foreseeable developments.
Collapse
Affiliation(s)
- David de Juan
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | |
Collapse
|
49
|
Suplatov D, Shalaeva D, Kirilin E, Arzhanik V, Švedas V. Bioinformatic analysis of protein families for identification of variable amino acid residues responsible for functional diversity. J Biomol Struct Dyn 2013; 32:75-87. [DOI: 10.1080/07391102.2012.750249] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
50
|
Sandler I, Abu-Qarn M, Aharoni A. Protein co-evolution: how do we combine bioinformatics and experimental approaches? MOLECULAR BIOSYSTEMS 2012; 9:175-81. [PMID: 23151606 DOI: 10.1039/c2mb25317h] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Molecular co-evolution is manifested by compensatory changes in proteins designed to enable adaptation to their natural environment. In recent years, bioinformatics approaches allowed for the detection of co-evolution at the level of the whole protein or of specific residues. Such efforts enabled prediction of protein-protein interactions, functional assignments of proteins and the identification of interacting residues, thereby providing information on protein structure. Still, despite such advances, relatively little is known regarding the functional implications of sequence divergence resulting from protein co-evolution. While bioinformatics approaches usually analyze thousands of proteins to obtain a broad view of protein co-evolution, experimental evaluation of protein co-evolution serves to study only individual proteins. In this review, we describe recent advances in bioinformatics and experimental efforts aimed at examining protein co-evolution. Accordingly, we discuss possible modes of crosstalk between the bioinformatics and experimental approaches to facilitate the identification of co-evolutionary signals in proteins and to understand their implications for the structure and function of proteins.
Collapse
Affiliation(s)
- Inga Sandler
- Department of Life Sciences, Ben-Gurion University of the Negev, Be'er Sheva 84105, Israel
| | | | | |
Collapse
|