1
|
Carbone A, Decelle A, Rosset L, Seoane B. Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:1309-1316. [PMID: 39527442 DOI: 10.1109/tpami.2024.3495999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied to the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to five different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, homologous RNA sequences from specific taxonomies and real classical piano pieces classified by their composer.
Collapse
|
2
|
Hernández Berthet AS, Aptekmann AA, Tejero J, Sánchez IE, Noguera ME, Roman EA. Associating protein sequence positions with the modulation of quantitative phenotypes. Arch Biochem Biophys 2024; 755:109979. [PMID: 38583654 DOI: 10.1016/j.abb.2024.109979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/11/2024] [Accepted: 03/27/2024] [Indexed: 04/09/2024]
Abstract
Although protein sequences encode the information for folding and function, understanding their link is not an easy task. Unluckily, the prediction of how specific amino acids contribute to these features is still considerably impaired. Here, we developed a simple algorithm that finds positions in a protein sequence with potential to modulate the studied quantitative phenotypes. From a few hundred protein sequences, we perform multiple sequence alignments, obtain the per-position pairwise differences for both the sequence and the observed phenotypes, and calculate the correlation between these last two quantities. We tested our methodology with four cases: archaeal Adenylate Kinases and the organisms optimal growth temperatures, microbial rhodopsins and their maximal absorption wavelengths, mammalian myoglobins and their muscular concentration, and inhibition of HIV protease clinical isolates by two different molecules. We found from 3 to 10 positions tightly associated with those phenotypes, depending on the studied case. We showed that these correlations appear using individual positions but an improvement is achieved when the most correlated positions are jointly analyzed. Noteworthy, we performed phenotype predictions using a simple linear model that links per-position divergences and differences in the observed phenotypes. Predictions are comparable to the state-of-art methodologies which, in most of the cases, are far more complex. All of the calculations are obtained at a very low information cost since the only input needed is a multiple sequence alignment of protein sequences with their associated quantitative phenotypes. The diversity of the explored systems makes our work a valuable tool to find sequence determinants of biological activity modulation and to predict various functional features for uncharacterized members of a protein family.
Collapse
Affiliation(s)
- Ayelén S Hernández Berthet
- Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Intendente Güiraldes 2160 - Ciudad Universitaria, 1428EGA, C.A.B.A., Argentina.
| | - Ariel A Aptekmann
- Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Buenos Aires, Argentina; Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, 08873, USA; Institute of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ, 08901, USA.
| | - Jesús Tejero
- Heart, Lung, Blood and Vascular Medicine Institute, University of Pittsburgh, Pittsburgh, PA, 15261, USA; Division of Pulmonary, Allergy and Critical Care Medicine, University of Pittsburgh, Pittsburgh, PA, 15261, USA; Department of Bioengineering, Swanson School of Engineering, University of Pittsburgh, Pittsburgh, PA, 15260, USA; Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
| | - Ignacio E Sánchez
- Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Buenos Aires, Argentina.
| | - Martín E Noguera
- Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química y Fisicoquímica Biológicas Dr. Alejandro Paladini, Junín 956, 1113AAD, C.A.B.A., Argentina; Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Roque Saenz Peña 352, B1876BXD, Bernal, Argentina.
| | - Ernesto A Roman
- Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Intendente Güiraldes 2160 - Ciudad Universitaria, 1428EGA, C.A.B.A., Argentina; Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química y Fisicoquímica Biológicas Dr. Alejandro Paladini, Junín 956, 1113AAD, C.A.B.A., Argentina.
| |
Collapse
|
3
|
Pazos F. Computational prediction of protein functional sites-Applications in biotechnology and biomedicine. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:39-57. [PMID: 35534114 DOI: 10.1016/bs.apcsb.2021.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
There are many computational approaches for predicting protein functional sites based on different sequence and structural features. These methods are essential to cope with the sequence deluge that is filling databases with uncharacterized protein sequences. They complement the more expensive and time-consuming experimental approaches by pointing them to possible candidate positions. In many cases they are jointly used to characterize the functional sites in proteins of biotechnological and biomedical interest and eventually modify them for different purposes. There is a clear trend towards approaches based on machine learning and those using structural information, due to the recent developments in these areas. Nevertheless, "classic" methods based on sequence and evolutionary features are still playing an important role as these features are strongly related to functionality. In this review, the main approaches for predicting general functional sites in a protein are discussed, with a focus on sequence-based approaches.
Collapse
Affiliation(s)
- Florencio Pazos
- Computational Systems Biology Group, National Center for Biotechnology (CNB-CSIC), Madrid, Spain.
| |
Collapse
|
4
|
Pazos F. Prediction of Protein Sites and Physicochemical Properties Related to Functional Specificity. Bioengineering (Basel) 2021; 8:bioengineering8120201. [PMID: 34940354 PMCID: PMC8698372 DOI: 10.3390/bioengineering8120201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 11/25/2021] [Accepted: 11/29/2021] [Indexed: 11/16/2022] Open
Abstract
Specificity Determining Positions (SDPs) are protein sites responsible for functional specificity within a family of homologous proteins. These positions are extracted from a family’s multiple sequence alignment and complement the fully conserved positions as predictors of functional sites. SDP analysis is now routinely used for locating these specificity-related sites in families of proteins of biomedical or biotechnological interest with the aim of mutating them to switch specificities or design new ones. There are many different approaches for detecting these positions in multiple sequence alignments. Nevertheless, existing methods report the potential SDP positions but they do not provide any clue on the physicochemical basis behind the functional specificity, which has to be inferred a-posteriori by manually inspecting these positions in the alignment. In this work, a new methodology is presented that, concomitantly with the detection of the SDPs, automatically provides information on the amino-acid physicochemical properties more related to the change in specificity. This new method is applied to two different multiple sequence alignments of homologous of the well-studied RasH protein representing different cases of functional specificity and the results discussed in detail.
Collapse
Affiliation(s)
- Florencio Pazos
- Computational Systems Biology Group, Systems Biology Department, National Centre for Biotechnology (CNB-CSIC), c/Darwin, 3, 28049 Madrid, Spain
| |
Collapse
|
5
|
Karakulak T, Rifaioglu AS, Rodrigues JPGLM, Karaca E. Predicting the Specificity- Determining Positions of Receptor Tyrosine Kinase Axl. Front Mol Biosci 2021; 8:658906. [PMID: 34195226 PMCID: PMC8236827 DOI: 10.3389/fmolb.2021.658906] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/20/2021] [Indexed: 11/22/2022] Open
Abstract
Owing to its clinical significance, modulation of functionally relevant amino acids in protein-protein complexes has attracted a great deal of attention. To this end, many approaches have been proposed to predict the partner-selecting amino acid positions in evolutionarily close complexes. These approaches can be grouped into sequence-based machine learning and structure-based energy-driven methods. In this work, we assessed these methods’ ability to map the specificity-determining positions of Axl, a receptor tyrosine kinase involved in cancer progression and immune system diseases. For sequence-based predictions, we used SDPpred, Multi-RELIEF, and Sequence Harmony. For structure-based predictions, we utilized HADDOCK refinement and molecular dynamics simulations. As a result, we observed that (i) sequence-based methods overpredict partner-selecting residues of Axl and that (ii) combining Multi-RELIEF with HADDOCK-based predictions provides the key Axl residues, covered by the extensive molecular dynamics simulations. Expanding on these results, we propose that a sequence-structure-based approach is necessary to determine specificity-determining positions of Axl, which can guide the development of therapeutic molecules to combat Axl misregulation.
Collapse
Affiliation(s)
- Tülay Karakulak
- Izmir Biomedicine and Genome Center, Izmir, Turkey.,Izmir International Biomedicine and Genome Institute, Dokuz Eylul University, Izmir, Turkey.,Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.,Department of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ahmet Sureyya Rifaioglu
- Department of Electrical - Electronics Engineering, İskenderun Technical University, Hatay, Turkey
| | - João P G L M Rodrigues
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA, United States
| | - Ezgi Karaca
- Izmir Biomedicine and Genome Center, Izmir, Turkey.,Izmir International Biomedicine and Genome Institute, Dokuz Eylul University, Izmir, Turkey
| |
Collapse
|
6
|
Pitarch B, Ranea JAG, Pazos F. Protein residues determining interaction specificity in paralogous families. Bioinformatics 2021; 37:1076-1082. [PMID: 33135068 DOI: 10.1093/bioinformatics/btaa934] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 10/06/2020] [Accepted: 10/22/2020] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Predicting the residues controlling a protein's interaction specificity is important not only to better understand its interactions but also to design mutations aimed at fine-tuning or swapping them as well. RESULTS In this work, we present a methodology that combines sequence information (in the form of multiple sequence alignments) with interactome information to detect that kind of residues in paralogous families of proteins. The interactome is used to define pairwise similarities of interaction contexts for the proteins in the alignment. The method looks for alignment positions with patterns of amino-acid changes reflecting the similarities/differences in the interaction neighborhoods of the corresponding proteins. We tested this new methodology in a large set of human paralogous families with structurally characterized interactions, and discuss in detail the results for the RasH family. We show that this approach is a better predictor of interfacial residues than both, sequence conservation and an equivalent 'unsupervised' method that does not use interactome information. AVAILABILITY AND IMPLEMENTATION http://csbg.cnb.csic.es/pazos/Xdet/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Borja Pitarch
- Computational Systems Biology Group, Systems Biology Department, National Centre for Biotechnology (CNB-CSIC), 28049 Madrid, Spain
| | - Juan A G Ranea
- Department of Molecular Biology and Biochemistry, University of Malaga, Malaga 29071, Spain.,CIBER de Enfermedades Raras, Instituto de Salud Carlos III, Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), Malaga, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, Systems Biology Department, National Centre for Biotechnology (CNB-CSIC), 28049 Madrid, Spain
| |
Collapse
|
7
|
Buhrman G, Enríquez P, Dillard L, Baer H, Truong V, Grunden AM, Rose RB. Structure, Function, and Thermal Adaptation of the Biotin Carboxylase Domain Dimer from Hydrogenobacter thermophilus 2-Oxoglutarate Carboxylase. Biochemistry 2021; 60:324-345. [PMID: 33464881 DOI: 10.1021/acs.biochem.0c00815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
2-Oxoglutarate carboxylase (OGC), a unique member of the biotin-dependent carboxylase family from the order Aquificales, captures dissolved CO2 via the reductive tricarboxylic acid (rTCA) cycle. Structure and function studies of OGC may facilitate adaptation of the rTCA cycle to increase the level of carbon fixation for biofuel production. Here we compare the biotin carboxylase (BC) domain of Hydrogenobacter thermophilus OGC with the well-studied mesophilic homologues to identify features that may contribute to thermal stability and activity. We report three OGC BC X-ray structures, each bound to bicarbonate, ADP, or ADP-Mg2+, and propose that substrate binding at high temperatures is facilitated by interactions that stabilize the flexible subdomain B in a partially closed conformation. Kinetic measurements with varying ATP and biotin concentrations distinguish two temperature-dependent steps, consistent with biotin's rate-limiting role in organizing the active site. Transition state thermodynamic values derived from the Eyring equation indicate a larger positive ΔH⧧ and a less negative ΔS⧧ compared to those of a previously reported mesophilic homologue. These thermodynamic values are explained by partially rate limiting product release. Phylogenetic analysis of BC domains suggests that OGC diverged prior to Aquificales evolution. The phylogenetic tree identifies mis-annotations of the Aquificales BC sequences, including the Aquifex aeolicus pyruvate carboxylase structure. Notably, our structural data reveal that the OGC BC dimer comprises a "wet" dimerization interface that is dominated by hydrophilic interactions and structural water molecules common to all BC domains and likely facilitates the conformational changes associated with the catalytic cycle. Mutations in the dimerization domain demonstrate that dimerization contributes to thermal stability.
Collapse
Affiliation(s)
- Greg Buhrman
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| | - Paul Enríquez
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| | - Lucas Dillard
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| | - Hayden Baer
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| | - Vivian Truong
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| | - Amy M Grunden
- Department of Plant & Microbial Biology, North Carolina State University, Raleigh, North Carolina 27695-7612, United States
| | - Robert B Rose
- Department of Molecular & Structural Biochemistry, North Carolina State University, Raleigh, North Carolina 27695-7622, United States
| |
Collapse
|
8
|
Garrido-Martín D, Pazos F. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues. BMC Bioinformatics 2018; 19:67. [PMID: 29482506 PMCID: PMC5827975 DOI: 10.1186/s12859-018-2084-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 02/21/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. RESULTS In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. CONCLUSIONS These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Collapse
Affiliation(s)
- Diego Garrido-Martín
- Present address: Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, c/ Dr. Aiguader, 88, 08003, Barcelona, Spain.,Present address: Universitat Pompeu Fabra (UPF), Plaça de la Mercè, 10-12, 08002, Barcelona, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, Systems Biology Program, National Centre for Biotechnology (CNB-CSIC), c/ Darwin, 3, 28049, Madrid, Spain.
| |
Collapse
|
9
|
Kalaivani R, Reema R, Srinivasan N. Recognition of sites of functional specialisation in all known eukaryotic protein kinase families. PLoS Comput Biol 2018; 14:e1005975. [PMID: 29438395 PMCID: PMC5826538 DOI: 10.1371/journal.pcbi.1005975] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2017] [Revised: 02/26/2018] [Accepted: 01/13/2018] [Indexed: 11/25/2022] Open
Abstract
The conserved function of protein phosphorylation, catalysed by members of protein kinase superfamily, is regulated in different ways in different kinase families. Further, differences in activating triggers, cellular localisation, domain architecture and substrate specificity between kinase families are also well known. While the transfer of γ-phosphate from ATP to the hydroxyl group of Ser/Thr/Tyr is mediated by a conserved Asp, the characteristic functional and regulatory sites are specialized at the level of families or sub-families. Such family-specific sites of functional specialization are unknown for most families of kinases. In this work, we systematically identify the family-specific residue features by comparing the extent of conservation of physicochemical properties, Shannon entropy and statistical probability of residue distributions between families of kinases. An integrated discriminatory score, which combines these three features, is developed to demarcate the functionally specialized sites in a kinase family from other sites. We achieved an area under ROC curve of 0.992 for the discrimination of kinase families. Our approach was extensively tested on well-studied families CDK and MAPK, wherein specific protein interaction sites and substrate recognition sites were successfully detected (p-value < 0.05). We also find that the known family-specific oncogenic driver mutation sites were scored high by our method. The method was applied to all known kinases encompassing 107 families from diverse eukaryotic organisms leading to a comprehensive list of family-specific functional sites. Apart from other uses, our method facilitates identification of specific protein interaction sites and drug target sites in a kinase family.
Collapse
Affiliation(s)
- Raju Kalaivani
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka, India
| | - Raju Reema
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka, India
| | | |
Collapse
|
10
|
Indrischek H, Prohaska SJ, Gurevich VV, Gurevich EV, Stadler PF. Uncovering missing pieces: duplication and deletion history of arrestins in deuterostomes. BMC Evol Biol 2017; 17:163. [PMID: 28683816 PMCID: PMC5501109 DOI: 10.1186/s12862-017-1001-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Accepted: 06/19/2017] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The cytosolic arrestin proteins mediate desensitization of activated G protein-coupled receptors (GPCRs) via competition with G proteins for the active phosphorylated receptors. Arrestins in active, including receptor-bound, conformation are also transducers of signaling. Therefore, this protein family is an attractive therapeutic target. The signaling outcome is believed to be a result of structural and sequence-dependent interactions of arrestins with GPCRs and other protein partners. Here we elucidated the detailed evolution of arrestins in deuterostomes. RESULTS Identity and number of arrestin paralogs were determined searching deuterostome genomes and gene expression data. In contrast to standard gene prediction methods, our strategy first detects exons situated on different scaffolds and then solves the problem of assigning them to the correct gene. This increases both the completeness and the accuracy of the annotation in comparison to conventional database search strategies applied by the community. The employed strategy enabled us to map in detail the duplication- and deletion history of arrestin paralogs including tandem duplications, pseudogenizations and the formation of retrogenes. The two rounds of whole genome duplications in the vertebrate stem lineage gave rise to four arrestin paralogs. Surprisingly, visual arrestin ARR3 was lost in the mammalian clades Afrotheria and Xenarthra. Duplications in specific clades, on the other hand, must have given rise to new paralogs that show signatures of diversification in functional elements important for receptor binding and phosphate sensing. CONCLUSION The current study traces the functional evolution of deuterostome arrestins in unprecedented detail. Based on a precise re-annotation of the exon-intron structure at nucleotide resolution, we infer the gain and loss of paralogs and patterns of conservation, co-variation and selection.
Collapse
Affiliation(s)
- Henrike Indrischek
- Computational EvoDevo Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany.
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany.
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany.
| | - Sonja J Prohaska
- Computational EvoDevo Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany
| | - Vsevolod V Gurevich
- Department of Pharmacology, Vanderbilt University, 2200 Pierce Ave, Nashville, TN 37232, USA
| | - Eugenia V Gurevich
- Department of Pharmacology, Vanderbilt University, 2200 Pierce Ave, Nashville, TN 37232, USA
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany
- Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, D-04107, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103, Germany
- Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, Leipzig, D-04103, Germany
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090, Austria
- Center for non-coding RNA in Technology and Health, Grønegårdsvej 3, Frederiksberg C, DK-1870, Denmark
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
| |
Collapse
|
11
|
Sloutsky R, Naegle KM. High-Resolution Identification of Specificity Determining Positions in the LacI Protein Family Using Ensembles of Sub-Sampled Alignments. PLoS One 2016; 11:e0162579. [PMID: 27681038 PMCID: PMC5040260 DOI: 10.1371/journal.pone.0162579] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2016] [Accepted: 08/08/2016] [Indexed: 01/24/2023] Open
Abstract
Since the advent of large-scale genomic sequencing, and the consequent availability of large numbers of homologous protein sequences, there has been burgeoning development of methods for extracting functional information from multiple sequence alignments (MSAs). One type of analysis seeks to identify specificity determining positions (SDPs) based on the assumption that such positions are highly conserved within groups of sequences sharing functional specificity, but conserved to different amino acids in different specificity groups. This unsupervised approach to utilizing evolutionary information may elucidate mechanisms of specificity in protein-protein interactions, catalytic activity of enzymes, sensitivity to allosteric regulation, and other types of protein functionality. We present an analysis of SDPs in the LacI family of transcriptional regulators in which we 1) relax the constraint that all specificity groups must contribute to SDP signal, and 2) use a novel approach to robust treatment of sequence alignment uncertainty based on sub-sampling. We find that the vast majority of SDP signal occurs at positions with a conservation pattern that significantly complicates detection by previously described methods. This pattern, which we term “partial SDP”, consists of the commonly accepted SDP conservation pattern among a subset of specificity groups and strong degeneracy among the rest. An upshot of this fact is that the SDP complement of every specificity group appears to be unique. Additionally, sub-sampling gives us the ability to assign a confidence interval to the SDP score, as well as increase fidelity, as compared to analysis of a single, comprehensive alignment—the current standard in multiple sequence alignment methodologies.
Collapse
Affiliation(s)
- Roman Sloutsky
- Biomedical Engineering Department, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
| | - Kristen M. Naegle
- Biomedical Engineering Department, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri, 63130, United States of America
- * E-mail:
| |
Collapse
|
12
|
Moll M, Finn PW, Kavraki LE. Structure-guided selection of specificity determining positions in the human Kinome. BMC Genomics 2016; 17 Suppl 4:431. [PMID: 27556159 PMCID: PMC5001202 DOI: 10.1186/s12864-016-2790-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background The human kinome contains many important drug targets. It is well-known that inhibitors of protein kinases bind with very different selectivity profiles. This is also the case for inhibitors of many other protein families. The increased availability of protein 3D structures has provided much information on the structural variation within a given protein family. However, the relationship between structural variations and binding specificity is complex and incompletely understood. We have developed a structural bioinformatics approach which provides an analysis of key determinants of binding selectivity as a tool to enhance the rational design of drugs with a specific selectivity profile. Results We propose a greedy algorithm that computes a subset of residue positions in a multiple sequence alignment such that structural and chemical variation in those positions helps explain known binding affinities. By providing this information, the main purpose of the algorithm is to provide experimentalists with possible insights into how the selectivity profile of certain inhibitors is achieved, which is useful for lead optimization. In addition, the algorithm can also be used to predict binding affinities for structures whose affinity for a given inhibitor is unknown. The algorithm’s performance is demonstrated using an extensive dataset for the human kinome. Conclusion We show that the binding affinity of 38 different kinase inhibitors can be explained with consistently high precision and accuracy using the variation of at most six residue positions in the kinome binding site. We show for several inhibitors that we are able to identify residues that are known to be functionally important.
Collapse
Affiliation(s)
- Mark Moll
- Department of Computer Science, Rice University, PO Box 1892, Houston, 77251, TX, USA.
| | - Paul W Finn
- University of Buckingham, Hunter St, Buckingham, UK
| | - Lydia E Kavraki
- Department of Computer Science, Rice University, PO Box 1892, Houston, 77251, TX, USA
| |
Collapse
|
13
|
A Bioinformatics Analysis Reveals a Group of MocR Bacterial Transcriptional Regulators Linked to a Family of Genes Coding for Membrane Proteins. Biochem Res Int 2016; 2016:4360285. [PMID: 27446613 PMCID: PMC4944035 DOI: 10.1155/2016/4360285] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 05/26/2016] [Indexed: 01/30/2023] Open
Abstract
The MocR bacterial transcriptional regulators are characterized by an N-terminal domain, 60 residues long on average, possessing the winged-helix-turn-helix (wHTH) architecture responsible for DNA recognition and binding, linked to a large C-terminal domain (350 residues on average) that is homologous to fold type-I pyridoxal 5′-phosphate (PLP) dependent enzymes like aspartate aminotransferase (AAT). These regulators are involved in the expression of genes taking part in several metabolic pathways directly or indirectly connected to PLP chemistry, many of which are still uncharacterized. A bioinformatics analysis is here reported that studied the features of a distinct group of MocR regulators predicted to be functionally linked to a family of homologous genes coding for integral membrane proteins of unknown function. This group occurs mainly in the Actinobacteria and Gammaproteobacteria phyla. An analysis of the multiple sequence alignments of their wHTH and AAT domains suggested the presence of specificity-determining positions (SDPs). Mapping of SDPs onto a homology model of the AAT domain hinted at possible structural/functional roles in effector recognition. Likewise, SDPs in wHTH domain suggested the basis of specificity of Transcription Factor Binding Site recognition. The results reported represent a framework for rational design of experiments and for bioinformatics analysis of other MocR subgroups.
Collapse
|
14
|
Boari de Lima E, Meira W, de Melo-Minardi RC. Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering. PLoS Comput Biol 2016; 12:e1005001. [PMID: 27348631 PMCID: PMC4922564 DOI: 10.1371/journal.pcbi.1005001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 05/22/2016] [Indexed: 01/14/2023] Open
Abstract
As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.
Collapse
Affiliation(s)
- Elisa Boari de Lima
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Wagner Meira
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | |
Collapse
|
15
|
Chagoyen M, García-Martín JA, Pazos F. Practical analysis of specificity-determining residues in protein families. Brief Bioinform 2015; 17:255-61. [DOI: 10.1093/bib/bbv045] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Accepted: 06/15/2015] [Indexed: 12/17/2022] Open
|
16
|
Chakraborty A, Chakrabarti S. A survey on prediction of specificity-determining sites in proteins. Brief Bioinform 2014; 16:71-88. [DOI: 10.1093/bib/bbt092] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
17
|
Wilkins AD, Venner E, Marciano DC, Erdin S, Atri B, Lua RC, Lichtarge O. Accounting for epistatic interactions improves the functional analysis of protein structures. Bioinformatics 2013; 29:2714-21. [PMID: 24021383 PMCID: PMC3799481 DOI: 10.1093/bioinformatics/btt489] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Motivation: The constraints under which sequence, structure and function coevolve are not fully understood. Bringing this mutual relationship to light can reveal the molecular basis of binding, catalysis and allostery, thereby identifying function and rationally guiding protein redesign. Underlying these relationships are the epistatic interactions that occur when the consequences of a mutation to a protein are determined by the genetic background in which it occurs. Based on prior data, we hypothesize that epistatic forces operate most strongly between residues nearby in the structure, resulting in smooth evolutionary importance across the structure. Methods and Results: We find that when residue scores of evolutionary importance are distributed smoothly between nearby residues, functional site prediction accuracy improves. Accordingly, we designed a novel measure of evolutionary importance that focuses on the interaction between pairs of structurally neighboring residues. This measure that we term pair-interaction Evolutionary Trace yields greater functional site overlap and better structure-based proteome-wide functional predictions. Conclusions: Our data show that the structural smoothness of evolutionary importance is a fundamental feature of the coevolution of sequence, structure and function. Mutations operate on individual residues, but selective pressure depends in part on the extent to which a mutation perturbs interactions with neighboring residues. In practice, this principle led us to redefine the importance of a residue in terms of the importance of its epistatic interactions with neighbors, yielding better annotation of functional residues, motivating experimental validation of a novel functional site in LexA and refining protein function prediction. Contact:lichtarge@bcm.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Angela D Wilkins
- Department of Molecular and Human Genetics, CIBR Center for Computational and Integrative Biomedical Research and Program in Structural and Computational Biology & Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030 and Center for Human Genetic Research, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | | | | | | | | | | | | |
Collapse
|
18
|
Gu X, Zou Y, Su Z, Huang W, Zhou Z, Arendsee Z, Zeng Y. An update of DIVERGE software for functional divergence analysis of protein family. Mol Biol Evol 2013; 30:1713-9. [PMID: 23589455 DOI: 10.1093/molbev/mst069] [Citation(s) in RCA: 146] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change at a particular amino acid site may contain insightful information about the underlying functional divergence following gene duplication. These predicted sites may be used as candidates for further experiments. We are now releasing an updated version of DIVERGE with the following improvements: 1) a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); 2) the calculation of the false discovery rate of functionally diverging sites; 3) estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs; 4) a statistical test for asymmetric functional divergence; and 5) a new method to infer functional divergence specific to a given duplicate cluster. In addition, we have made efforts to improve software design and produce a well-written software manual for the general user.
Collapse
Affiliation(s)
- Xun Gu
- State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, China.
| | | | | | | | | | | | | |
Collapse
|
19
|
Suplatov D, Shalaeva D, Kirilin E, Arzhanik V, Švedas V. Bioinformatic analysis of protein families for identification of variable amino acid residues responsible for functional diversity. J Biomol Struct Dyn 2013; 32:75-87. [DOI: 10.1080/07391102.2012.750249] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
20
|
Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 2013; 449:581-94. [DOI: 10.1042/bj20121221] [Citation(s) in RCA: 131] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The present review focuses on the evolution of proteins and the impact of amino acid mutations on function from a structural perspective. Proteins evolve under the law of natural selection and undergo alternating periods of conservative evolution and of relatively rapid change. The likelihood of mutations being fixed in the genome depends on various factors, such as the fitness of the phenotype or the position of the residues in the three-dimensional structure. For example, co-evolution of residues located close together in three-dimensional space can occur to preserve global stability. Whereas point mutations can fine-tune the protein function, residue insertions and deletions (‘decorations’ at the structural level) can sometimes modify functional sites and protein interactions more dramatically. We discuss recent developments and tools to identify such episodic mutations, and examine their applications in medical research. Such tools have been tested on simulated data and applied to real data such as viruses or animal sequences. Traditionally, there has been little if any cross-talk between the fields of protein biophysics, protein structure–function and molecular evolution. However, the last several years have seen some exciting developments in combining these approaches to obtain an in-depth understanding of how proteins evolve. For example, a better understanding of how structural constraints affect protein evolution will greatly help us to optimize our models of sequence evolution. The present review explores this new synthesis of perspectives.
Collapse
|
21
|
Teppa E, Wilkins AD, Nielsen M, Buslje CM. Disentangling evolutionary signals: conservation, specificity determining positions and coevolution. Implication for catalytic residue prediction. BMC Bioinformatics 2012; 13:235. [PMID: 22978315 PMCID: PMC3515339 DOI: 10.1186/1471-2105-13-235] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Accepted: 09/05/2012] [Indexed: 11/11/2022] Open
Abstract
Background A large panel of methods exists that aim to identify residues with critical impact on protein function based on evolutionary signals, sequence and structure information. However, it is not clear to what extent these different methods overlap, and if any of the methods have higher predictive potential compared to others when it comes to, in particular, the identification of catalytic residues (CR) in proteins. Using a large set of enzymatic protein families and measures based on different evolutionary signals, we sought to break up the different components of the information content within a multiple sequence alignment to investigate their predictive potential and degree of overlap. Results Our results demonstrate that the different methods included in the benchmark in general can be divided into three groups with a limited mutual overlap. One group containing real-value Evolutionary Trace (rvET) methods and conservation, another containing mutual information (MI) methods, and the last containing methods designed explicitly for the identification of specificity determining positions (SDPs): integer-value Evolutionary Trace (ivET), SDPfox, and XDET. In terms of prediction of CR, we find using a proximity score integrating structural information (as the sum of the scores of residues located within a given distance of the residue in question) that only the methods from the first two groups displayed a reliable performance. Next, we investigated to what degree proximity scores for conservation, rvET and cumulative MI (cMI) provide complementary information capable of improving the performance for CR identification. We found that integrating conservation with proximity scores for rvET and cMI achieved the highest performance. The proximity conservation score contained no complementary information when integrated with proximity rvET. Moreover, the signal from rvET provided only a limited gain in predictive performance when integrated with mutual information and conservation proximity scores. Combined, these observations demonstrate that the rvET and cMI scores add complementary information to the prediction system. Conclusions This work contributes to the understanding of the different signals of evolution and also shows that it is possible to improve the detection of catalytic residues by integrating structural and higher order sequence evolutionary information with sequence conservation.
Collapse
Affiliation(s)
- Elin Teppa
- Fundación Instituto Leloir, Avda, Patricias Argentinas 435, CABA, C1405BWE, Argentina
| | | | | | | |
Collapse
|
22
|
Aiello D, Caffrey DR. Evolution of specific protein-protein interaction sites following gene duplication. J Mol Biol 2012; 423:257-72. [PMID: 22789570 DOI: 10.1016/j.jmb.2012.06.039] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Revised: 05/16/2012] [Accepted: 06/29/2012] [Indexed: 11/15/2022]
Abstract
Gene duplication is a common evolutionary process that leads to the expansion and functional diversification of protein subfamilies. The evolutionary events that cause paralogous proteins to bind different protein ligands (functionally diverged interfaces) are investigated and compared to paralogous proteins that bind the same protein ligand (functionally preserved interfaces). We find that functionally diverged interfaces possess more subfamily-specific residues than functionally preserved interfaces. These subfamily-specific residues are usually partially buried at the interface rim and achieve specific binding through optimized hydrogen bond geometries. In addition to optimized hydrogen bond geometries, side-chain modeling experiments suggest that steric effects are also important for binding specificity. Residues that are completely buried at the interface hub are also less conserved in functionally diverged interfaces than in functionally preserved interfaces. Consistent with this finding, hub residues contribute less to free energy of binding in functionally diverged interfaces than in functionally preserved interfaces. Therefore, we propose that protein binding is a delicate balance between binding affinity that primarily occurs at the interface hub and binding specificity that primarily occurs at the interface rim.
Collapse
Affiliation(s)
- Daniel Aiello
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | | |
Collapse
|
23
|
González AJ, Liao L, Wu CH. Predicting ligand binding residues and functional sites using multipositional correlations with graph theoretic clustering and kernel CCA. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:992-1001. [PMID: 22025754 DOI: 10.1109/tcbb.2011.136] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues’ evolutionary characterization at the sites and the structure-based functional classification of the proteins in the context of a functional family. The results of testing the method on two well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC) scores improves significantly when multipositional correlations are accounted for.
Collapse
Affiliation(s)
- Alvaro J González
- Computer and Information Sciences Department, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA.
| | | | | |
Collapse
|
24
|
Sorn S, Sok T, Ly S, Rith S, Tung N, Viari A, Gavotte L, Holl D, Seng H, Asgari N, Richner B, Laurent D, Chea N, Duong V, Toyoda T, Yasuda CY, Kitsutani P, Zhou P, Bing S, Deubel V, Donis R, Frutos R, Buchy P. Dynamic of H5N1 virus in Cambodia and emergence of a novel endemic sub-clade. INFECTION GENETICS AND EVOLUTION 2012; 15:87-94. [PMID: 22683363 DOI: 10.1016/j.meegid.2012.05.013] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/21/2012] [Revised: 05/13/2012] [Accepted: 05/28/2012] [Indexed: 11/28/2022]
Abstract
In Cambodia, the first detection of HPAI H5N1 virus in birds occurred in January 2004 and since then there have been 33 outbreaks in poultry while 21 human cases were reported. The origin and dynamics of these epizootics in Cambodia remain unclear. In this work we used a range of bioinformatics methods to analyze the Cambodian virus sequences together with those from neighboring countries. Six HA lineages belonging to clades 1 and 1.1 were identified since 2004. Lineage 1 shares an ancestor with viruses from Thailand and disappeared after 2005, to be replaced by lineage 2 originating from Vietnam and then by lineage 3. The highly adapted lineage 4 was seen only in Cambodia. Lineage 5 is circulating both in Vietnam and Cambodia since 2008 and was probably introduced in Cambodia through unregistered transboundary poultry trade. Lineage 6 is endemic to Cambodia since 2010 and could be classified as a new clade according to WHO/OIE/FAO criteria for H5N1 virus nomenclature. We propose to name it clade 1.1A. There is a direct filiation of lineages 2 to 6 with a temporal evolution and geographic differentiation for lineages 4 and 6. By the end of 2011, two lineages, i.e. lineages 5 and 6, with different transmission paths cocirculate in Cambodia. The presence of lineage 6 only in Cambodia suggests the existence of a transmission specific to this country whereas the presence of lineage 5 in both Cambodia and Vietnam indicates a distinct way of circulation of infected poultry.
Collapse
Affiliation(s)
- San Sorn
- Virology Unit/National Influenza Centre, Institut Pasteur in Cambodia, 5 Monivong Blvd, PO Box 983, Phnom Penh, Cambodia
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Complex dynamic of dengue virus serotypes 2 and 3 in Cambodia following series of climate disasters. INFECTION GENETICS AND EVOLUTION 2012; 15:77-86. [PMID: 22677620 DOI: 10.1016/j.meegid.2012.05.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2012] [Revised: 05/13/2012] [Accepted: 05/16/2012] [Indexed: 11/23/2022]
Abstract
The Dengue National Control Program was established in Cambodia in 2000 and has reported between 10,000 and 40,000 dengue cases per year with a case fatality rate ranging from 0.7 to 1.7. In this study 39 DENV-2 and 57 DENV-3 viruses isolated from patients between 2000 and 2008 were fully sequenced. Five DENV2 and four DENV3 distinct lineages with different dynamics were identified. Each lineage was characterized by the presence of specific mutations with no evidence of recombination. In both DENV-2 and DENV-3 the lineages present prior to 2003 were replaced after that date by unrelated lineages. After 2003, DENV-2 lineages D2-3 and D2-4 cocirculated until 2007 when they were almost completely replaced by a lineage D2-5 which emerged from D2-3 Conversely, all DENV-3 lineages remained, diversified and cocirculated with novel lineages emerging. Years 2006 and 2007 were marked by a high prevalence of DENV-3 and 2007 with a large dengue outbreak and a high proportion of patients with severe disease. Selective sweeps in DENV-1 and DENV-2 were linked to immunological escape to a predominately DENV-3-driven immunological response. The complex dynamic of dengue in Cambodia in the last ten years has been associated with a combination of stochastic climatic events, cocirculation, coevolution, adaptation to different vector populations, and with the human population immunological landscape.
Collapse
|
26
|
Muth T, Garcia-Martin JA, Rausell A, Juan D, Valencia A, Pazos F. JDet: interactive calculation and visualization of function-related conservation patterns in multiple sequence alignments and structures. Bioinformatics 2011; 28:584-6. [DOI: 10.1093/bioinformatics/btr688] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
27
|
Schmidt T, Haas J, Gallo Cassarino T, Schwede T. Assessment of ligand-binding residue predictions in CASP9. Proteins 2011; 79 Suppl 10:126-36. [PMID: 21987472 DOI: 10.1002/prot.23174] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 07/29/2011] [Accepted: 08/04/2011] [Indexed: 11/06/2022]
Abstract
Interactions between proteins and their ligands play central roles in many physiological processes. The structural details for most of these interactions, however, have not yet been characterized experientially. Therefore, various computational tools have been developed to predict the location of binding sites and the amino acid residues interacting with ligands. In this manuscript, we assess the performance of 33 methods participating in the ligand-binding site prediction category in CASP9. The overall accuracy of ligand-binding site predictions in CASP9 appears rather high (average Matthews correlation coefficient of 0.62 for the 10 top performing groups) and compared to previous experiments more groups performed equally well. However, this should be seen in context of a strong bias in the test data toward easy template-based models. Overall, the top performing methods have converged to a similar approach using ligand-binding site inference from related homologous structures, which limits their applicability for difficult de novo prediction targets. Here, we present the results of the CASP9 assessment of the ligand-binding site category, discuss examples for successful and challenging prediction targets in CASP9, and finally suggest changes in the format of the experiment to overcome the current limitations of the assessment.
Collapse
Affiliation(s)
- Tobias Schmidt
- Biozentrum, University of Basel, SIB Swiss Institute of Bioinformatics, Klingelbergstrasse 50-70, Basel, Switzerland
| | | | | | | |
Collapse
|
28
|
Benítez-Páez A, Cárdenas-Brito S, Gutiérrez AJ. A practical guide for the computational selection of residues to be experimentally characterized in protein families. Brief Bioinform 2011; 13:329-36. [PMID: 21930656 DOI: 10.1093/bib/bbr052] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In recent years, numerous biocomputational tools have been designed to extract functional and evolutionary information from multiple sequence alignments (MSAs) of proteins and genes. Most biologists working actively on the characterization of proteins from a single or family perspective use the MSA analysis to retrieve valuable information about amino acid conservation and the functional role of residues in query protein(s). In MSAs, adjustment of alignment parameters is a key point to improve the quality of MSA output. However, this issue is frequently underestimated and/or misunderstood by scientists and there is no in-depth knowledge available in this field. This brief review focuses on biocomputational approaches complementary to MSA to help distinguish functional residues in protein families. These additional analyses involve issues ranging from phylogenetic to statistical, which address the detection of amino acids pivotal for protein function at any level. In recent years, a large number of tools has been designed for this very purpose. Using some of these relevant, useful tools, we have designed a practical pipeline to perform in silico studies with a view to improving the characterization of family proteins and their functional residues. This review-guide aims to present biologists a set of specially designed tools to study proteins. These tools are user-friendly as they use web servers or easy-to-handle applications. Such criteria are essential for this review as most of the biologists (experimentalists) working in this field are unfamiliar with these biocomputational analysis approaches.
Collapse
Affiliation(s)
- Alfonso Benítez-Páez
- Bioinformatic Analysis Group - GABi, Centro de Investigación y Desarrollo en Biotecnología, Bogotá DC, Colombia.
| | | | | |
Collapse
|
29
|
Duong V, Simmons C, Gavotte L, Viari A, Ong S, Chantha N, Lennon NJ, Birren BW, Vong S, Farrar JJ, Henn MR, Deubel V, Frutos R, Buchy P. Genetic diversity and lineage dynamic of dengue virus serotype 1 (DENV-1) in Cambodia. INFECTION GENETICS AND EVOLUTION 2011; 15:59-68. [PMID: 21757030 DOI: 10.1016/j.meegid.2011.06.019] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2011] [Revised: 06/12/2011] [Accepted: 06/27/2011] [Indexed: 11/28/2022]
Abstract
In Cambodia, dengue virus (DENV) was first isolated in 1963 and has become endemic with peak epidemic during raining season. Since 2000, the Dengue National Control Program has reported from 10,000 to 40,000 cases per year with fatality rates ranging from 0.7 to 1.7. All four dengue serotypes are found circulating in Cambodia with alternative predominance of serotypes DENV-2 and DENV-3. The DENV-1 represents from 5% to 20% of all circulating viruses, depending upon the year. In this work, 79 clinical strains of DENV-1 were isolated between 2000 and 2009 and their genome fully sequenced. Four distinct lineages with different dynamics were identified. The main evolutionary drive was negative selective pressure but each lineage was characterized by the presence of specific mutations acquired through evolution. Coexistence, extinction and replacement of lineages occurred over the 10-year period. Lineages 1, 2 and 3 were all detected since 2000-2002 and disappeared in 2003, 2004-2005 and 2007, respectively. Lineages 1 and 2 displayed different dynamics. Lineage 1 was very diverse whereas lineage 2 was very homogeneous. Lineage 4 which derived from lineage 3 in 2003 remained the only one at the end of the sampling period in 2008-2009 owing to a selective sweep. The lineages dynamic of DENV-1 viruses and consequences for molecular epidemiology are discussed.
Collapse
Affiliation(s)
- Veasna Duong
- Institut Pasteur in Cambodia, Réseau International des Instituts Pasteur, 5 Monivong Boulevard, PO Box 983, Phnom Penh, Cambodia
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Wass MN, David A, Sternberg MJE. Challenges for the prediction of macromolecular interactions. Curr Opin Struct Biol 2011; 21:382-90. [DOI: 10.1016/j.sbi.2011.03.013] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Revised: 03/04/2011] [Accepted: 03/24/2011] [Indexed: 12/14/2022]
|
31
|
Cleveland SB, Davies J, McClure MA. A bioinformatics approach to the structure, function, and evolution of the nucleoprotein of the order mononegavirales. PLoS One 2011; 6:e19275. [PMID: 21559282 PMCID: PMC3086907 DOI: 10.1371/journal.pone.0019275] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Accepted: 04/01/2011] [Indexed: 01/09/2023] Open
Abstract
The goal of this Bioinformatic study is to investigate sequence conservation in relation to evolutionary function/structure of the nucleoprotein of the order Mononegavirales. In the combined analysis of 63 representative nucleoprotein (N) sequences from four viral families (Bornaviridae, Filoviridae, Rhabdoviridae, and Paramyxoviridae) we predict the regions of protein disorder, intra-residue contact and co-evolving residues. Correlations between location and conservation of predicted regions illustrate a strong division between families while high- lighting conservation within individual families. These results suggest the conserved regions among the nucleoproteins, specifically within Rhabdoviridae and Paramyxoviradae, but also generally among all members of the order, reflect an evolutionary advantage in maintaining these sites for the viral nucleoprotein as part of the transcription/replication machinery. Results indicate conservation for disorder in the C-terminus region of the representative proteins that is important for interacting with the phosphoprotein and the large subunit polymerase during transcription and replication. Additionally, the C-terminus region of the protein preceding the disordered region, is predicted to be important for interacting with the encapsidated genome. Portions of the N-terminus are responsible for N∶N stability and interactions identified by the presence or lack of co-evolving intra-protein contact predictions. The validation of these prediction results by current structural information illustrates the benefits of the Disorder, Intra-residue contact and Compensatory mutation Correlator (DisICC) pipeline as a method for quickly characterizing proteins and providing the most likely residues and regions necessary to target for disruption in viruses that have little structural information available.
Collapse
Affiliation(s)
- Sean B Cleveland
- Department of Microbiology and the Center for Computational Biology, Montana State University, Bozeman, Montana, USA.
| | | | | |
Collapse
|
32
|
Tungtur S, Parente DJ, Swint-Kruse L. Functionally important positions can comprise the majority of a protein's architecture. Proteins 2011; 79:1589-608. [PMID: 21374721 PMCID: PMC3076786 DOI: 10.1002/prot.22985] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Revised: 12/08/2010] [Accepted: 12/15/2010] [Indexed: 01/13/2023]
Abstract
Concomitant with the genomic era, many bioinformatics programs have been developed to identify functionally important positions from sequence alignments of protein families. To evaluate these analyses, many have used the LacI/GalR family and determined whether positions predicted to be "important" are validated by published experiments. However, we previously noted that predictions do not identify all of the experimentally important positions present in the linker regions of these homologs. In an attempt to reconcile these differences, we corrected and expanded the LacI/GalR sequence set commonly used in sequence/function analyses. Next, a variety of analyses were carried out (1) for the entire LacI/GalR sequence set and (2) for a subset of homologs with functionally-important "YxPxxxAxxL" motifs in their linkers. This strategy was devised to determine whether predictions could be improved by knowledge-based sequence sorting and-for some analyses-did increase the number of linker positions identified. However, two functionally important linker positions were not reliably identified by any analysis. Finally, we compared the new predictions to all known experimental data for E. coli LacI and three homologous linkers. From these, we estimate that >50% of positions are important to the functions of the LacI/GalR homologs. In corollary, neutral positions might occur less frequently and might be easier to detect in sequence analyses. Although analyses have successfully guided mutations that partially exchange protein functions, a better experimental understanding of the sequence/function relationships in protein families would be helpful for uncovering the remaining rules used by nature to evolve new protein functions.
Collapse
Affiliation(s)
| | | | - Liskin Swint-Kruse
- Department of Biochemistry and Molecular Biology, The University of Kansas Medical Center, 3901 Rainbow Blvd., MSN 3030, Kansas City, KS 66160
| |
Collapse
|
33
|
Kc DB, Livesay DR. Topology improves phylogenetic motif functional site predictions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:226-233. [PMID: 21071810 DOI: 10.1109/tcbb.2009.60] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Prediction of protein functional sites from sequence-derived data remains an open bioinformatics problem. We have developed a phylogenetic motif (PM) functional site prediction approach that identifies functional sites from alignment fragments that parallel the evolutionary patterns of the family. In our approach, PMs are identified by comparing tree topologies of each alignment fragment to that of the complete phylogeny. Herein, we bypass the phylogenetic reconstruction step and identify PMs directly from distance matrix comparisons. In order to optimize the new algorithm, we consider three different distance matrices and 13 different matrix similarity scores. We assess the performance of the various approaches on a structurally nonredundant data set that includes three types of functional site definitions. Without exception, the predictive power of the original approach outperforms the distance matrix variants. While the distance matrix methods fail to improve upon the original approach, our results are important because they clearly demonstrate that the improved predictive power is based on the topological comparisons. Meaning that phylogenetic trees are a straightforward, yet powerful way to improve functional site prediction accuracy. While complementary studies have shown that topology improves predictions of protein-protein interactions, this report represents the first demonstration that trees improve functional site predictions as well.
Collapse
Affiliation(s)
- Dukka B Kc
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA.
| | | |
Collapse
|
34
|
de Melo-Minardi RC, Bastard K, Artiguenave F. Identification of subfamily-specific sites based on active sites modeling and clustering. ACTA ACUST UNITED AC 2010; 26:3075-82. [PMID: 20980272 DOI: 10.1093/bioinformatics/btq595] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Current computational approaches to function prediction are mostly based on protein sequence classification and transfer of annotation from known proteins to their closest homologous sequences relying on the orthology concept of function conservation. This approach suffers a major weakness: annotation reliability depends on global sequence similarity to known proteins and is poorly efficient for enzyme superfamilies that catalyze different reactions. Structural biology offers a different strategy to overcome the problem of annotation by adding information about protein 3D structures. This information can be used to identify amino acids located in active sites, focusing on detection of functional polymorphisms residues in an enzyme superfamily. Structural genomics programs are providing more and more novel protein structures at a high-throughput rate. However, there is still a huge gap between the number of sequences and available structures. Computational methods, such as homology modeling provides reliable approaches to bridge this gap and could be a new precise tool to annotate protein functions. RESULTS Here, we present Active Sites Modeling and Clustering (ASMC) method, a novel unsupervised method to classify sequences using structural information of protein pockets. ASMC combines homology modeling of family members, structural alignment of modeled active sites and a subsequent hierarchical conceptual classification. Comparison of profiles obtained from computed clusters allows the identification of residues correlated to subfamily function divergence, called specificity determining positions. ASMC method has been validated on a benchmark of 42 Pfam families for which previous resolved holo-structures were available. ASMC was also applied to several families containing known protein structures and comprehensive functional annotations. We will discuss how ASMC improves annotation and understanding of protein families functions by giving some specific illustrative examples on nucleotidyl cyclases, protein kinases and serine proteases. AVAILABILITY http://www.genoscope.fr/ASMC/.
Collapse
|
35
|
Brandt BW, Feenstra KA, Heringa J. Multi-Harmony: detecting functional specificity from sequence alignment. Nucleic Acids Res 2010; 38:W35-40. [PMID: 20525785 PMCID: PMC2896201 DOI: 10.1093/nar/gkq415] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.
Collapse
Affiliation(s)
- Bernd W Brandt
- Centre for Integrative Bioinformatics, VU University Amsterdam, De Boelelaan 1081A, 1081HV Amsterdam, The Netherlands
| | | | | |
Collapse
|
36
|
Hung SS, Wasmuth J, Sanford C, Parkinson J. DETECT--a density estimation tool for enzyme classification and its application to Plasmodium falciparum. ACTA ACUST UNITED AC 2010; 26:1690-8. [PMID: 20513663 DOI: 10.1093/bioinformatics/btq266] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION A major challenge in genomics is the accurate annotation of component genes. Enzymes are typically predicted using homology-based search methods, where the membership of a protein to an enzyme family is based on single-sequence comparisons. As such, these methods are often error-prone and lack useful measures of reliability for the prediction. RESULTS Here, we present DETECT, a probabilistic method for enzyme prediction that accounts for the sequence diversity across enzyme families. By comparing the global alignment scores of an unknown protein to those of all known enzymes, an integrated likelihood score can be readily calculated, ranking the reaction classes relevant for that protein. Comparisons to BLAST reveal significant improvements in enzyme annotation accuracy. Applied to Plasmodium falciparum, we identify potential annotation errors and predict novel enzymes of therapeutic interest. AVAILABILITY A standalone application is available from the website: http://www.compsysbio.org/projects/DETECT/
Collapse
Affiliation(s)
- Stacy S Hung
- Program in Molecular Structure and Function, Hospital for Sick Children, 15-704 MaRS TMDT East, 101 College Street, Toronto, ON M5G 1L7, Canada
| | | | | | | |
Collapse
|
37
|
Izarzugaza JMG, Redfern OC, Orengo CA, Valencia A. Cancer-associated mutations are preferentially distributed in protein kinase functional sites. Proteins 2010; 77:892-903. [PMID: 19626714 DOI: 10.1002/prot.22512] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Protein kinases are a superfamily involved in many crucial cellular processes, including signal transmission and regulation of cell cycle. As a consequence of this role, kinases have been reported to be associated with many types of cancer and are considered as potential therapeutic targets. We analyzed the distribution of pathogenic somatic point mutations (drivers) in the protein kinase superfamily with respect to their location in the protein, such as in structural, evolutionary, and functionally relevant regions. We find these driver mutations are more clearly associated with key protein features than other somatic mutations (passengers) that have not been directly linked to tumor progression. This observation fits well with the expected implication of the alterations in protein kinase function in cancer pathogenicity. To explain the relevance of the detected association of cancer driver mutations at the molecular level in the human kinome, we compare these with genetically inherited mutations (SNPs). We find that the subset of nonsynonymous SNPs that are associated to disease, but sufficiently mild to the point of being widespread in the population, tend to avoid those key protein regions, where they could be more detrimental for protein function. This tendency contrasts with the one detected for cancer associated-driver-mutations, which seems to be more directly implicated in the alteration of protein function. The detailed analysis of protein kinase groups and a number of relevant examples, confirm the relation between cancer associated-driver-mutations and key regions for protein kinase structure and function.
Collapse
Affiliation(s)
- Jose M G Izarzugaza
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), C/Melchor Fernández Almagro 3, Madrid E28029, Spain
| | | | | | | |
Collapse
|
38
|
Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc Natl Acad Sci U S A 2010; 107:1995-2000. [PMID: 20133844 DOI: 10.1073/pnas.0908044107] [Citation(s) in RCA: 113] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The divergence accumulated during the evolution of protein families translates into their internal organization as subfamilies, and it is directly reflected in the characteristic patterns of differentially conserved residues. These specifically conserved positions in protein subfamilies are known as "specificity determining positions" (SDPs). Previous studies have limited their analysis to the study of the relationship between these positions and ligand-binding specificity, demonstrating significant yet limited predictive capacity. We have systematically extended this observation to include the role of differential protein interactions in the segregation of protein subfamilies and explored in detail the structural distribution of SDPs at protein interfaces. Our results show the extensive influence of protein interactions in the evolution of protein families and the widespread association of SDPs with protein interfaces. The combined analysis of SDPs in interfaces and ligand-binding sites provides a more complete picture of the organization of protein families, constituting the necessary framework for a large scale analysis of the evolution of protein function.
Collapse
|
39
|
Goldstein P, Zucko J, Vujaklija D, Krisko A, Hranueli D, Long PF, Etchebest C, Basrak B, Cullum J. Clustering of protein domains for functional and evolutionary studies. BMC Bioinformatics 2009; 10:335. [PMID: 19832975 PMCID: PMC2770074 DOI: 10.1186/1471-2105-10-335] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2009] [Accepted: 10/15/2009] [Indexed: 11/16/2022] Open
Abstract
Background The number of protein family members defined by DNA sequencing is usually much larger than those characterised experimentally. This paper describes a method to divide protein families into subtypes purely on sequence criteria. Comparison with experimental data allows an independent test of the quality of the clustering. Results An evolutionary split statistic is calculated for each column in a protein multiple sequence alignment; the statistic has a larger value when a column is better described by an evolutionary model that assumes clustering around two or more amino acids rather than a single amino acid. The user selects columns (typically the top ranked columns) to construct a motif. The motif is used to divide the family into subtypes using a stochastic optimization procedure related to the deterministic annealing EM algorithm (DAEM), which yields a specificity score showing how well each family member is assigned to a subtype. The clustering obtained is not strongly dependent on the number of amino acids chosen for the motif. The robustness of this method was demonstrated using six well characterized protein families: nucleotidyl cyclase, protein kinase, dehydrogenase, two polyketide synthase domains and small heat shock proteins. Phylogenetic trees did not allow accurate clustering for three of the six families. Conclusion The method clustered the families into functional subtypes with an accuracy of 90 to 100%. False assignments usually had a low specificity score.
Collapse
Affiliation(s)
- Pavle Goldstein
- Department of Genetics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Comparing the functional roles of nonconserved sequence positions in homologous transcription repressors: implications for sequence/function analyses. J Mol Biol 2009; 395:785-802. [PMID: 19818797 DOI: 10.1016/j.jmb.2009.10.001] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2009] [Revised: 10/01/2009] [Accepted: 10/02/2009] [Indexed: 11/21/2022]
Abstract
The explosion of protein sequences deduced from genetic code has led to both a problem and a potential resource: Efficient data use requires interpreting the functional impact of sequence change without experimentally characterizing each protein variant. Several groups have hypothesized that interpretation could be aided by analyzing the sequences of naturally occurring homologues. To that end, myriad sequence/function analyses have been developed to predict which conserved, semi-conserved, and nonconserved positions are functionally important. These positions must be discriminated from the nonconserved positions that are functionally silent. However, the assumptions that underlie sequence analyses are based on experimental results that are sparse and usually designed to address different questions. Here, we use three homologues from a test family common to bioinformatics-the LacI/GalR transcription repressors-to test a common assumption: If a position is functionally important for one family member, it has similar importance in all homologues. We generated experimental sequence/function information for each nonconserved position in the 18 amino acids that link the DNA-binding and regulatory domains of three LacI/GalR homologues. We find that the functional importance of each position is preserved among the three linkers, albeit to different degrees. We also find that every linker position contributes to function, which has twofold implications. (1) Since the linker positions range from highly conserved to semi-conserved to nonconserved and contribute to affinity, selectivity, and allosteric response, we assert that sequence/function analyses must identify positions in the LacI/GalR linkers to be qualified as "successful". Many analyses overlook this region since most of the residues do not directly contact ligand. (2) No position in the LacI/GalR linker is functionally silent. This finding is inconsistent with another underlying principle of many analyses: Using sequence sets to discriminate important from non-contributing positions obligates silent positions, which denotes that most homologues tolerate a variety of amino acid substitutions at the position without functional change. Instead, additional combinatorial mutants in the LacI/GalR linkers show that particular substitutions can be silent in a context-dependent manner. Thus, specific permutations of sequence change (rather than change at silent positions) would facilitate neutral drift during evolution. Finally, the combinatorial mutants also reveal functional synergy between semi- and nonconserved positions. Such functional relationships would be missed by analyses that rely primarily upon co-evolution.
Collapse
|
41
|
Frenkel-Morgenstern M, Tworowski D, Klipcan L, Safro M. Intra-protein compensatory mutations analysis highlights the tRNA recognition regions in aminoacyl-tRNA synthetases. J Biomol Struct Dyn 2009; 27:115-26. [PMID: 19583438 DOI: 10.1080/07391102.2009.10507302] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
The aminoacyl-tRNA synthetases (aaRSs) covalently attach amino acids to their corresponding nucleic acid adapter molecules, tRNAs. The interactions in the tRNA-aaRSs complexes are mostly non-specific, and largely electrostatic. Tracing a way of aaRS-tRNA mutual adaptation throughout evolution offers a clearer view of understanding how aaRS-tRNA systems preserve patterns of tRNA recognition and binding. In this study, we used the compensatory mutations analysis to explore adaptation of aaRSs in respond to random mutations that can occur in the tRNA-recognition area. We showed that the frequency of compensatory mutations among residues that belong to the recognition region is 1.75-fold higher than that of the exposed residues. The highest frequencies of compensatory mutations are observed for pairs of charged residues, wherein one residue is located within the tRNA-recognition area, while the second is placed outside of the area, and contributes to the formation of the aaRS electrostatic landscape. Given charged residues are compensated by buried charge residues in more than 60% of the analyzed mutations. The cytoplasmatic and mitochondrial aaRSs preserve similar patterns of compensatory mutations in the tRNA recognition areas. Moreover, we found that mitochondrial aaRSs demonstrate a significant increase in the frequency of compensatory mutations in the area. Our findings shed light on the physical nature of compensatory mutations in aaRSs, thereby keeping unchanged tRNA-recognition patterns.
Collapse
|
42
|
Abstract
Here we detail the assessment process for the binding site prediction category of the eighth Critical Assessment of Protein Structure Prediction experiment (CASP8). Predictions were only evaluated for those targets that bound biologically relevant ligands and were assessed using the Matthews Correlation Coefficient. The results of the analysis clearly demonstrate that three predictors from two groups (Lee and Sternberg) stand out from the rest. A further two groups perform well over subsets of metal binding or nonmetal ligand binding targets. The best methods were able to make consistently reliable predictions based on model structures, though it was noticeable that the two targets that were not well predicted were also the hardest targets. The number of predictors that submitted new methods in this category was highly encouraging and suggests that current technology is at the level that experimental biochemists and structural biologists could benefit from what is clearly a growing field.
Collapse
Affiliation(s)
- Tobias Schmidt
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Jürgen Haas
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Tiziano Gallo Cassarino
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
43
|
Meinhardt S, Swint-Kruse L. Experimental identification of specificity determinants in the domain linker of a LacI/GalR protein: bioinformatics-based predictions generate true positives and false negatives. Proteins 2008; 73:941-57. [PMID: 18536016 PMCID: PMC2585155 DOI: 10.1002/prot.22121] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In protein families, conserved residues often contribute to a common general function, such as DNA-binding. However, unique attributes for each homolog (e.g. recognition of alternative DNA sequences) must arise from variation in other functionally-important positions. The locations of these "specificity determinant" positions are obscured amongst the background of varied residues that do not make significant contributions to either structure or function. To isolate specificity determinants, a number of bioinformatics algorithms have been developed. When applied to the LacI/GalR family of transcription regulators, several specificity determinants are predicted in the 18 amino acids that link the DNA-binding and regulatory domains. However, results from alternative algorithms are only in partial agreement with each other. Here, we experimentally evaluate these predictions using an engineered repressor comprising the LacI DNA-binding domain, the LacI linker, and the GalR regulatory domain (LLhG). "Wild-type" LLhG has altered DNA specificity and weaker lacO(1) repression compared to LacI or a similar LacI:PurR chimera. Next, predictions of linker specificity determinants were tested, using amino acid substitution and in vivo repression assays to assess functional change. In LLhG, all predicted sites are specificity determinants, as well as three sites not predicted by any algorithm. Strategies are suggested for diminishing the number of false negative predictions. Finally, individual substitutions at LLhG specificity determinants exhibited a broad range of functional changes that are not predicted by bioinformatics algorithms. Results suggest that some variants have altered affinity for DNA, some have altered allosteric response, and some appear to have changed specificity for alternative DNA ligands.
Collapse
Affiliation(s)
- Sarah Meinhardt
- Department of Biochemistry and Molecular Biology, MSN 3030, 3901 Rainbow Blvd., The University of Kansas Medical Center, Kansas City, KS 66160
| | - Liskin Swint-Kruse
- Department of Biochemistry and Molecular Biology, MSN 3030, 3901 Rainbow Blvd., The University of Kansas Medical Center, Kansas City, KS 66160
| |
Collapse
|
44
|
Dukka BKC, Livesay DR. Improving position-specific predictions of protein functional sites using phylogenetic motifs. ACTA ACUST UNITED AC 2008; 24:2308-16. [PMID: 18723520 DOI: 10.1093/bioinformatics/btn454] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Accurate computational prediction of protein functional sites is critical to maximizing the utility of recent high-throughput sequencing efforts. Among the available approaches, position-specific conservation scores remain among the most popular due to their accuracy and ease of computation. Unfortunately, high false positive rates remain a limiting factor. Using phylogenetic motifs (PMs), we have developed two combined (conservation + PMs) prediction schemes that significantly improve prediction accuracy. RESULTS Our first approach, called position-specific MINER (psMINER), rank orders alignment columns by conservation. Subsequently, positions that are also not identified as PMs are excluded from the prediction set. This approach improves prediction accuracy, in a statistically significant way, compared to the underlying conservation scores. Increased accuracy is a general result, meaning improvement is observed over several different conservation scores that span a continuum of complexity. In addition, a hybrid MINER (hMINER) that quantitatively considers both scoring regimes provides further improvement. More importantly, it provides critical insight into the relative importance of phylogeny versus alignment conservation. Both methods outperform other common prediction algorithms that also utilize phylogenetic concepts. Finally, we demonstrate that the presented results are critically sensitive to functional site definition, thus highlighting the need for more complete benchmarks within the prediction community.
Collapse
Affiliation(s)
- Bahadur K C Dukka
- Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | | |
Collapse
|
45
|
Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr Opin Struct Biol 2008; 18:394-402. [PMID: 18554899 PMCID: PMC2561214 DOI: 10.1016/j.sbi.2008.05.007] [Citation(s) in RCA: 95] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2008] [Revised: 04/16/2008] [Accepted: 05/07/2008] [Indexed: 11/29/2022]
Abstract
Advances in protein structure determination, led by the structural genomics initiatives have increased the proportion of novel folds deposited in the Protein Data Bank. However, these structures are often not accompanied by functional annotations with experimental confirmation. In this review, we reassess the meaning of structural novelty and examine its relevance to the complexity of the structure-function paradigm. Recent advances in the prediction of protein function from structure are discussed, as well as new sequence-based methods for partitioning large, diverse superfamilies into biologically meaningful clusters. Obtaining structural data for these functionally coherent groups of proteins will allow us to better understand the relationship between structure and function.
Collapse
Affiliation(s)
- Oliver C Redfern
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
| | | | | |
Collapse
|
46
|
Capra JA, Singh M. Characterization and prediction of residues determining protein functional specificity. ACTA ACUST UNITED AC 2008; 24:1473-80. [PMID: 18450811 PMCID: PMC2718669 DOI: 10.1093/bioinformatics/btn214] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. RESULTS We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. AVAILABILITY Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- John A Capra
- Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | |
Collapse
|
47
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 371] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
48
|
How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics 2007; 8:153. [PMID: 17498304 PMCID: PMC1876251 DOI: 10.1186/1471-2105-8-153] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 05/11/2007] [Indexed: 11/25/2022] Open
Abstract
Background We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex i and all other vertices. Results We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined. Conclusion Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.
Collapse
|
49
|
Wallace IM, Higgins DG. Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics 2007; 8:135. [PMID: 17451607 PMCID: PMC1878507 DOI: 10.1186/1471-2105-8-135] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2006] [Accepted: 04/23/2007] [Indexed: 11/29/2022] Open
Abstract
Background Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments. Results We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids. Conclusion This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
Collapse
Affiliation(s)
- Iain M Wallace
- The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| | - Desmond G Higgins
- The Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|