1
|
Zhang J, He W, Liang L, Sun B, Zhang Y. Study on the saltiness-enhancing mechanism of chicken-derived umami peptides by sensory evaluation and molecular docking to transmembrane channel-like protein 4 (TMC4). Food Res Int 2024; 182:114139. [PMID: 38519171 DOI: 10.1016/j.foodres.2024.114139] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/08/2024] [Accepted: 02/17/2024] [Indexed: 03/24/2024]
Abstract
The previously obtained chicken-derived umami peptides in the laboratory were evaluated for their saltiness-enhancing effect by sensory evaluation and S-curve, and the results revealed that peptides TPPKID, PKESEKPN, TEDWGR, LPLQDAH, NEFGYSNR, and LPLQD had significant saltiness-enhancing effects. In the binary solution system with salt, the ratio of the experimental detection threshold (129.17 mg/L) to the theoretical detection threshold (274.43 mg/L) of NEFGYSNR was 0.47, which had a synergistic saltiness-enhancing effect with salt. The model of transmembrane channel-like protein 4 (TMC4) channel protein was constructed by homology modeling, which had a 10-fold transmembrane structure and was well evaluated. Molecular docking and frontier molecular orbitals showed that the main active sites of TMC4 were Lys 471, Met 379, Cys 475, Gln 377, and Pro 380, and the main active sites of NEFGYSNR were Tyr, Ser and Asn. This study may provide a theoretical reference for low-sodium diets.
Collapse
Affiliation(s)
- Jingcheng Zhang
- China Key Laboratory of Geriatric Nutrition and Health (Beijing Technology and Business University), Ministry of Education, 100048, China; Key Laboratory of Flavor Science of China General Chamber of Commerce, Beijing Technology and Business University, 100048, China; Food Laboratory of Zhongyuan, Beijing Technology and Business University, 100048, China
| | - Wei He
- China Key Laboratory of Geriatric Nutrition and Health (Beijing Technology and Business University), Ministry of Education, 100048, China; Key Laboratory of Flavor Science of China General Chamber of Commerce, Beijing Technology and Business University, 100048, China; Food Laboratory of Zhongyuan, Beijing Technology and Business University, 100048, China
| | - Li Liang
- China Key Laboratory of Geriatric Nutrition and Health (Beijing Technology and Business University), Ministry of Education, 100048, China; Key Laboratory of Flavor Science of China General Chamber of Commerce, Beijing Technology and Business University, 100048, China; Food Laboratory of Zhongyuan, Beijing Technology and Business University, 100048, China
| | - Baoguo Sun
- China Key Laboratory of Geriatric Nutrition and Health (Beijing Technology and Business University), Ministry of Education, 100048, China; Key Laboratory of Flavor Science of China General Chamber of Commerce, Beijing Technology and Business University, 100048, China; Food Laboratory of Zhongyuan, Beijing Technology and Business University, 100048, China
| | - Yuyu Zhang
- China Key Laboratory of Geriatric Nutrition and Health (Beijing Technology and Business University), Ministry of Education, 100048, China; Key Laboratory of Flavor Science of China General Chamber of Commerce, Beijing Technology and Business University, 100048, China; Food Laboratory of Zhongyuan, Beijing Technology and Business University, 100048, China.
| |
Collapse
|
2
|
Hao C, Elias JE, Lee PKH, Lam H. metaSpectraST: an unsupervised and database-independent analysis workflow for metaproteomic MS/MS data using spectrum clustering. MICROBIOME 2023; 11:176. [PMID: 37550758 PMCID: PMC10405559 DOI: 10.1186/s40168-023-01602-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 06/18/2023] [Indexed: 08/09/2023]
Abstract
BACKGROUND The high diversity and complexity of the microbial community make it a formidable challenge to identify and quantify the large number of proteins expressed in the community. Conventional metaproteomics approaches largely rely on accurate identification of the MS/MS spectra to their corresponding short peptides in the digested samples, followed by protein inference and subsequent taxonomic and functional analysis of the detected proteins. These approaches are dependent on the availability of protein sequence databases derived either from sample-specific metagenomic data or from public repositories. Due to the incompleteness and imperfections of these protein sequence databases, and the preponderance of homologous proteins expressed by different bacterial species in the community, this computational process of peptide identification and protein inference is challenging and error-prone, which hinders the comparison of metaproteomes across multiple samples. RESULTS We developed metaSpectraST, an unsupervised and database-independent metaproteomics workflow, which quantitatively profiles and compares metaproteomics samples by clustering experimentally observed MS/MS spectra based on their spectral similarity. We applied metaSpectraST to fecal samples collected from littermates of two different mother mice right after weaning. Quantitative proteome profiles of the microbial communities of different mice were obtained without any peptide-spectrum identification and used to evaluate the overall similarity between samples and highlight any differentiating markers. Compared to the conventional database-dependent metaproteomics analysis, metaSpectraST is more successful in classifying the samples and detecting the subtle microbiome changes of mouse gut microbiomes post-weaning. metaSpectraST could also be used as a tool to select the suitable biological replicates from samples with wide inter-individual variation. CONCLUSIONS metaSpectraST enables rapid profiling of metaproteomic samples quantitatively, without the need for constructing the protein sequence database or identification of the MS/MS spectra. It maximally preserves information contained in the experimental MS/MS spectra by clustering all of them first and thus is able to better profile the complex microbial communities and highlight their functional changes, as compared with conventional approaches. tag the videobyte in this section as ESM4 Video Abstract.
Collapse
Affiliation(s)
- Chunlin Hao
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
- School of Energy and Environment, City University of Hong Kong, Hong Kong SAR, China
| | | | - Patrick K. H. Lee
- School of Energy and Environment, City University of Hong Kong, Hong Kong SAR, China
- State Key Laboratory of Marine Pollution, City University of Hong Kong, Hong Kong SAR, China
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| |
Collapse
|
3
|
Grahame DSA, Dupuis JH, Bryksa BC, Tanaka T, Yada RY. Comparative bioinformatic and structural analyses of pepsin and renin. Enzyme Microb Technol 2020; 141:109632. [PMID: 33051007 DOI: 10.1016/j.enzmictec.2020.109632] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 06/25/2020] [Accepted: 07/08/2020] [Indexed: 11/16/2022]
Abstract
Pepsin, the archetypal pepsin-like aspartic protease, is irreversibly denatured when exposed to neutral pH conditions whereas renin, a structural homologue of pepsin, is fully stable and optimally active in the same conditions despite sharing highly similar enzyme architecture. To gain insight into the structural determinants of differential aspartic protease pH stability, the present study used comparative bioinformatic and structural analyses. In pepsin, an abundance of polar and aspartic acid residues were identified, a common trait with other acid-stable enzymes. Conversely, renin was shown to have increased levels of basic amino acids. In both pepsin and renin, the solvent exposure of these charged groups was high. Having similar overall acidic residue content, the solvent-exposed basic residues may allow for extensive salt bridge formation in renin, whereas in pepsin, these residues are protonated and serve to form stabilizing hydrogen bonds at low pH. Relative differences in structure and sequence in the turn and joint regions of the β-barrel and ψ-loop in both the N- and C-terminal lobes were identified as regions of interest in defining divergent pH stability. Compared to the structural rigidity of renin, pepsin has more instability associated with the N-terminus, specifically the B/C connector. By contrast, renin exhibits greater C-terminal instability in turn and connector regions. Overall, flexibility differences in connector regions, and amino acid composition, particularly in turn and joint regions of the β-barrel and ψ-loops, likely play defining roles in determining pH stability for renin and pepsin.
Collapse
Affiliation(s)
- Douglas S A Grahame
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - John H Dupuis
- Food, Nutrition, and Health Program, Faculty of Land and Food Systems, University of British Columbia, Vancouver, BC, V6T 1Z4 Canada
| | - Brian C Bryksa
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Takuji Tanaka
- Department of Food and Bioproduct Sciences, College of Agriculture and Bioresources, University of Saskatchewan, Saskatoon, SK, S7N 5A8 Canada
| | - Rickey Y Yada
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada; Food, Nutrition, and Health Program, Faculty of Land and Food Systems, University of British Columbia, Vancouver, BC, V6T 1Z4 Canada.
| |
Collapse
|
4
|
Burroughs AM, Glasner ME, Barry KP, Taylor EA, Aravind L. Oxidative opening of the aromatic ring: Tracing the natural history of a large superfamily of dioxygenase domains and their relatives. J Biol Chem 2019; 294:10211-10235. [PMID: 31092555 PMCID: PMC6664185 DOI: 10.1074/jbc.ra119.007595] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Revised: 05/09/2019] [Indexed: 12/20/2022] Open
Abstract
A diverse collection of enzymes comprising the protocatechuate dioxygenases (PCADs) has been characterized in several extradiol aromatic compound degradation pathways. Structural studies have shown a relationship between PCADs and the more broadly-distributed, functionally enigmatic Memo domain linked to several human diseases. To better understand the evolution of this PCAD-Memo protein superfamily, we explored their structural and functional determinants to establish a unified evolutionary framework, identifying 15 clearly-delineable families, including a previously-underappreciated diversity in five Memo clade families. We place the superfamily's origin within the greater radiation of the nucleoside phosphorylase/hydrolase-peptide/amidohydrolase fold prior to the last universal common ancestor of all extant organisms. In addition to identifying active-site residues across the superfamily, we describe three distinct, structurally-variable regions emanating from the core scaffold often housing conserved residues specific to individual families. These were predicted to contribute to the active-site pocket, potentially in substrate specificity and allosteric regulation. We also identified several previously-undescribed conserved genome contexts, providing insight into potentially novel substrates in PCAD clade families. We extend known conserved contextual associations for the Memo clade beyond previously-described associations with the AMMECR1 domain and a radical S-adenosylmethionine family domain. These observations point to two distinct yet potentially overlapping contexts wherein the elusive molecular function of the Memo domain could be finally resolved, thereby linking it to nucleotide base and aliphatic isoprenoid modification. In total, this report throws light on the functions of large swaths of the experimentally-uncharacterized PCAD-Memo families.
Collapse
Affiliation(s)
- A Maxwell Burroughs
- From the Computational Biology Branch, NCBI, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | - Margaret E Glasner
- the Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas 77843, and
| | - Kevin P Barry
- the Department of Chemistry, Wesleyan University, Middletown, Connecticut 06459
| | - Erika A Taylor
- the Department of Chemistry, Wesleyan University, Middletown, Connecticut 06459
| | - L Aravind
- From the Computational Biology Branch, NCBI, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894,
| |
Collapse
|
5
|
Watson AK, Lannes R, Pathmanathan JS, Méheust R, Karkar S, Colson P, Corel E, Lopez P, Bapteste E. The Methodology Behind Network Thinking: Graphs to Analyze Microbial Complexity and Evolution. Methods Mol Biol 2019; 1910:271-308. [PMID: 31278668 DOI: 10.1007/978-1-4939-9074-0_9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In the post genomic era, large and complex molecular datasets from genome and metagenome sequencing projects expand the limits of what is possible for bioinformatic analyses. Network-based methods are increasingly used to complement phylogenetic analysis in studies in molecular evolution, including comparative genomics, classification, and ecological studies. Using network methods, the vertical and horizontal relationships between all genes or genomes, whether they are from cellular chromosomes or mobile genetic elements, can be explored in a single expandable graph. In recent years, development of new methods for the construction and analysis of networks has helped to broaden the availability of these approaches from programmers to a diversity of users. This chapter introduces the different kinds of networks based on sequence similarity that are already available to tackle a wide range of biological questions, including sequence similarity networks, gene-sharing networks and bipartite graphs, and a guide for their construction and analyses.
Collapse
Affiliation(s)
- Andrew K Watson
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Romain Lannes
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Jananan S Pathmanathan
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Raphaël Méheust
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Slim Karkar
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
- Department of Ecology, Evolution, and Natural Resources, School of Environmental and Biological Sciences, Rutgers, The State University of NJ, New Brunswick, NJ, USA
| | - Philippe Colson
- Fondation Institut Hospitalo-Universitaire Méditerranée Infection, Pôle des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie-Hygiène-Virologie, Centre Hospitalo-Universitaire Tione, Assistance Publique-Hôpitaux de Marseille, Marseille, France
- Unité de Recherche sur les Maladies Infectieuses et Tropicales Emergentes (URMITE) UM63, CNRS 7278, IRD 198, INSERM U1095, Aix-Marseille University, Marseille, France
| | - Eduardo Corel
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Philippe Lopez
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France
| | - Eric Bapteste
- Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université Paris 6, Paris, France.
| |
Collapse
|
6
|
Krishnaraju RK, Hart TC, Schleyer TK. Comparative Genomics and Structure Prediction of Dental Matrix Proteins. Adv Dent Res 2016; 17:100-3. [PMID: 15126218 DOI: 10.1177/154407370301700123] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Non-collagenous matrix proteins secreted by the ameloblasts (amelogenin) and odontoblasts (osteocalcin) play important roles in the mineralization of enamel and dentin. In this study, comparative genomics approaches were used to identify the functional domains and model the three-dimensional structure of amelogenin and osteocalcin, respectively. Multiple sequence analysis of amelogenin in different species showed a high degree of sequence conservation at the nucleotide and protein levels. At the protein level, motifs (a sequence pattern that occurs repeatedly in a group of related proteins or genes), conserved domains, secondary structural characteristics, and functional sites of amelogenin from lower phyla were similar to those of the higher-level mammals, reflecting the high degree of sequence conservation during vertebrate evolution. Osteocalcin, produced by both odontoblasts and osetoblasts, also showed sequence similarity between species. Three-dimensional structure predictions developed by modeling of conserved domains of osteocalcin supported a role for glutamic acid residues in the calcium mineralization process.
Collapse
Affiliation(s)
- R K Krishnaraju
- Center for Biomedical Informatics, University of Pittsburgh, PA 15261, USA.
| | | | | |
Collapse
|
7
|
Szilágyi SM, Szilágyi L. A fast hierarchical clustering algorithm for large-scale protein sequence data sets. Comput Biol Med 2014; 48:94-101. [PMID: 24657908 DOI: 10.1016/j.compbiomed.2014.02.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 02/10/2014] [Accepted: 02/25/2014] [Indexed: 10/25/2022]
Abstract
TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1min in the case of the 11,944proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm׳s parameter values.
Collapse
Affiliation(s)
- Sándor M Szilágyi
- Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.
| | - László Szilágyi
- Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania.
| |
Collapse
|
8
|
Seo JH, Park J, Kim EM, Kim J, Joo K, Lee J, Kim BG. Subgrouping Automata: automatic sequence subgrouping using phylogenetic tree-based optimum subgrouping algorithm. Comput Biol Chem 2014; 48:64-70. [PMID: 24378653 DOI: 10.1016/j.compbiolchem.2013.11.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Revised: 10/12/2013] [Accepted: 11/23/2013] [Indexed: 11/28/2022]
Abstract
Sequence subgrouping for a given sequence set can enable various informative tasks such as the functional discrimination of sequence subsets and the functional inference of unknown sequences. Because an identity threshold for sequence subgrouping may vary according to the given sequence set, it is highly desirable to construct a robust subgrouping algorithm which automatically identifies an optimal identity threshold and generates subgroups for a given sequence set. To meet this end, an automatic sequence subgrouping method, named 'Subgrouping Automata' was constructed. Firstly, tree analysis module analyzes the structure of tree and calculates the all possible subgroups in each node. Sequence similarity analysis module calculates average sequence similarity for all subgroups in each node. Representative sequence generation module finds a representative sequence using profile analysis and self-scoring for each subgroup. For all nodes, average sequence similarities are calculated and 'Subgrouping Automata' searches a node showing statistically maximum sequence similarity increase using Student's t-value. A node showing the maximum t-value, which gives the most significant differences in average sequence similarity between two adjacent nodes, is determined as an optimum subgrouping node in the phylogenetic tree. Further analysis showed that the optimum subgrouping node from SA prevents under-subgrouping and over-subgrouping.
Collapse
Affiliation(s)
- Joo-Hyun Seo
- School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea; School of Computational Sciences, Korea Institute of Advanced Study, Seoul 130-722, Republic of Korea
| | - Jihyang Park
- School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea
| | - Eun-Mi Kim
- School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea
| | - Juhan Kim
- School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea
| | - Keehyoung Joo
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 130-722, Republic of Korea
| | - Jooyoung Lee
- School of Computational Sciences, Korea Institute of Advanced Study, Seoul 130-722, Republic of Korea
| | - Byung-Gee Kim
- School of Chemical and Biological Engineering, Seoul National University, Seoul 151-742, Republic of Korea.
| |
Collapse
|
9
|
Devi PP, Adhikari S. Homology modeling and functional sites prediction of azoreductase enzyme from the cyanobacterium Nostoc sp. PCC7120. Interdiscip Sci 2013; 4:310-8. [DOI: 10.1007/s12539-012-0140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2012] [Revised: 05/02/2012] [Accepted: 07/30/2012] [Indexed: 10/27/2022]
|
10
|
Szilágyi L, Szilágyi SM. Efficient Markov clustering algorithm for protein sequence grouping. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2013; 2013:639-642. [PMID: 24109768 DOI: 10.1109/embc.2013.6609581] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
In this paper we propose an efficient reformulation of a Markov clustering algorithm, suitable for fast and accurate grouping of protein sequences, based on pairwise similarity information. The proposed modification consists of optimal reordering of rows and columns in the similarity matrix after every iteration, transforming it into a matrix with several compact blocks along the diagonal, and zero similarities outside the blocks. These blocks are treated separately in later iterations, thus reducing the computational burden of the algorithm. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution achieves a speed-up factor in the range 15-50 compared to the conventional Markov clustering, depending on input data size and parameter settings. This improvement in computation time is reached without losing anything from the partition accuracy. The convergence is usually reached in 40-50 iterations. Combining the proposed method with sparse matrix representation and parallel execution will certainly lead to a significantly more efficient solution in future.
Collapse
|
11
|
Abstract
Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications.
Collapse
Affiliation(s)
- Rui Xu
- Industrial Artificial Intelligence Laboratory, GE Global Research Center, Niskayuna, NY 12309, USA.
| | | |
Collapse
|
12
|
Szilágyi L, Medvés L, Szilágyi SM. A modified Markov clustering approach to unsupervised classification of protein sequences. Neurocomputing 2010. [DOI: 10.1016/j.neucom.2010.02.023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
13
|
Povolotskaya IS, Kondrashov FA. Sequence space and the ongoing expansion of the protein universe. Nature 2010; 465:922-6. [PMID: 20485343 DOI: 10.1038/nature09105] [Citation(s) in RCA: 149] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2009] [Accepted: 04/19/2010] [Indexed: 11/09/2022]
Abstract
The need to maintain the structural and functional integrity of an evolving protein severely restricts the repertoire of acceptable amino-acid substitutions. However, it is not known whether these restrictions impose a global limit on how far homologous protein sequences can diverge from each other. Here we explore the limits of protein evolution using sequence divergence data. We formulate a computational approach to study the rate of divergence of distant protein sequences and measure this rate for ancient proteins, those that were present in the last universal common ancestor. We show that ancient proteins are still diverging from each other, indicating an ongoing expansion of the protein sequence universe. The slow rate of this divergence is imposed by the sparseness of functional protein sequences in sequence space and the ruggedness of the protein fitness landscape: approximately 98 per cent of sites cannot accept an amino-acid substitution at any given moment but a vast majority of all sites may eventually be permitted to evolve when other, compensatory, changes occur. Thus, approximately 3.5 x 10(9) yr has not been enough to reach the limit of divergent evolution of proteins, and for most proteins the limit of sequence similarity imposed by common function may not exceed that of random sequences.
Collapse
Affiliation(s)
- Inna S Povolotskaya
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation, Calle Dr Aiguader 88, Barcelona Biomedical Research Park Building, 08003 Barcelona, Spain
| | | |
Collapse
|
14
|
Abstract
Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms. Results: Here, we investigate a network-rewiring strategy designed to eliminate edges due to promiscuous domains. We show that this strategy can reduce noise in and restore structure to artificial networks with simulated noise, as well as to the yeast genome homology network. We further evaluate this approach on a hand-curated set of multidomain sequences in mouse and human, and demonstrate that classification using the rewired network delivers dramatic improvement in Precision and Recall, compared with current methods. Families in our test set exhibit a broad range of domain architectures and sequence conservation, demonstrating that our method is flexible, robust and suitable for high-throughput, automated processing of heterogeneous, genome-scale data. contact:jacobmj@cmu.edu
Collapse
Affiliation(s)
- Jacob M Joseph
- Department of Biological and Computer Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | | |
Collapse
|
15
|
Han Y, Burnette JM, Wessler SR. TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences. Nucleic Acids Res 2009; 37:e78. [PMID: 19429695 PMCID: PMC2699529 DOI: 10.1093/nar/gkp295] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2009] [Revised: 04/15/2009] [Accepted: 04/15/2009] [Indexed: 11/23/2022] Open
Abstract
Gene families compose a large proportion of eukaryotic genomes. The rapidly expanding genomic sequence database provides a good opportunity to study gene family evolution and function. However, most gene family identification programs are restricted to searching protein databases where data are often lagging behind the genomic sequence data. Here, we report a user-friendly web-based pipeline, named TARGeT (Tree Analysis of Related Genes and Transposons), which uses either a DNA or amino acid 'seed' query to: (i) automatically identify and retrieve gene family homologs from a genomic database, (ii) characterize gene structure and (iii) perform phylogenetic analysis. Due to its high speed, TARGeT is also able to characterize very large gene families, including transposable elements (TEs). We evaluated TARGeT using well-annotated datasets, including the ascorbate peroxidase gene family of rice, maize and sorghum and several TE families in rice. In all cases, TARGeT rapidly recapitulated the known homologs and predicted new ones. We also demonstrated that TARGeT outperforms similar pipelines and has functionality that is not offered elsewhere.
Collapse
Affiliation(s)
| | | | - Susan R. Wessler
- Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
16
|
Nakamura T, Kotani M, Tonozuka T, Ide A, Oguma K, Nishikawa A. Crystal Structure of the HA3 Subcomponent of Clostridium botulinum Type C Progenitor Toxin. J Mol Biol 2009; 385:1193-206. [DOI: 10.1016/j.jmb.2008.11.039] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2008] [Revised: 11/15/2008] [Accepted: 11/19/2008] [Indexed: 11/30/2022]
|
17
|
Linial M. Fishing with (Proto)Net-a principled approach to protein target selection. Comp Funct Genomics 2008; 4:542-8. [PMID: 18629007 PMCID: PMC2447289 DOI: 10.1002/cfg.328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2003] [Revised: 08/05/2003] [Accepted: 08/05/2003] [Indexed: 12/02/2022] Open
Abstract
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have
not been determined, but that represent an as yet unknown structural superfamily
or fold. Once such a structure is solved, it can be used as a template for modelling
homologous proteins. This will aid in unveiling the structural diversity of the protein
space. Currently, no reliable method for accurate 3D structural prediction is available
when a sequence or a structure homologue is not available. Here we present a
systematic methodology for selecting target proteins whose structure is likely to
adopt a new, as yet unknown superfamily or fold. Our method takes advantage
of a global classification of the sequence space as presented by ProtoNet-3D, which
is a hierarchical agglomerative clustering of the proteins of interest (the proteins in
Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in
the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet
structurally solved, along with the probability of each of the proteins belonging to a
new superfamily or fold. The sorted list has been self-validated against real structural
data that was not available when the predictions were made. The practical application
of using our computational–statistical method to determine novel superfamilies for
structural genomics projects is also discussed.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
| |
Collapse
|
18
|
Heger A, Korpelainen E, Hupponen T, Mattila K, Ollikainen V, Holm L. PairsDB atlas of protein sequence space. Nucleic Acids Res 2008; 36:D276-80. [PMID: 17986464 PMCID: PMC2238971 DOI: 10.1093/nar/gkm879] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2007] [Revised: 09/28/2007] [Accepted: 10/01/2007] [Indexed: 11/12/2022] Open
Abstract
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.
Collapse
Affiliation(s)
- Andreas Heger
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Eija Korpelainen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Taavi Hupponen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Kimmo Mattila
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Vesa Ollikainen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Liisa Holm
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| |
Collapse
|
19
|
Heger A, Mallick S, Wilton C, Holm L. The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics 2007; 23:2361-7. [PMID: 17823134 DOI: 10.1093/bioinformatics/btm358] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Propagating functional annotations to sequence-similar, presumably homologous proteins lies at the heart of the bioinformatics industry. Correct propagation is crucially dependent on the accurate identification of subtle sequence motifs that are conserved in evolution. The evolutionary signal can be difficult to detect because functional sites may consist of non-contiguous residues while segments in-between may be mutated without affecting fold or function. RESULTS Here, we report a novel graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments. This eliminates noise so that non-contiguous sequence motifs can be tracked down between extremely distant homologues. The novel data structure enables fast sequence database searching methods which are superior to profile-profile comparison at recognizing distant homologues. This study will boost the leverage of structural and functional genomics and opens up new avenues for data mining a complete set of functional signature motifs. AVAILABILITY http://www.bioinfo.biocenter.helsinki.fi/gtg. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andreas Heger
- Institute of Biotechnology, P.O. Box 56 (Viikinkaari 5), FI-00014 University of Helsinki, Finland
| | | | | | | |
Collapse
|
20
|
Langille MGI, Clark DV. Parent genes of retrotransposition-generated gene duplicates in Drosophila melanogaster have distinct expression profiles. Genomics 2007; 90:334-43. [PMID: 17628393 DOI: 10.1016/j.ygeno.2007.06.001] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2006] [Revised: 05/26/2007] [Accepted: 06/05/2007] [Indexed: 01/12/2023]
Abstract
Genes arising by retrotransposition are always different from their parent genes from the outset. In addition, the cDNA must insert into a region that allows expression or it will become a processed pseudogene. We sought to determine whether this class of gene duplication differs from other gene duplications based on functional criteria. Using amino acid sequences from Drosophila melanogaster, we identified retroduplicated gene pairs at various levels of sequence identity. Analysis of gene ontology annotations showed some enrichment of retroduplications in the cellular physiological processes class. Retroduplications show a higher level of nucleotide substitution than other gene duplications, suggesting a higher rate of divergence. Remarkably, analysis of microarray data for gene expression during embryogenesis showed that parent genes are more highly expressed relative to their retroduplicated copies, tandem duplications, and all genes. Furthermore, an expressed sequence tag library representation shows a broader distribution for parent genes than for all other genes and, as found previously by others, retroduplicated gene transcripts are found most abundantly in testes. Therefore, in examining retroduplicated gene pairs, we have found that parent genes of retroduplications are also a distinctive class in terms of transcript expression levels and distribution.
Collapse
Affiliation(s)
- Morgan G I Langille
- Department of Biology, University of New Brunswick, Fredericton, Canada NB E3B 6E1
| | | |
Collapse
|
21
|
Frenkel ZM, Trifonov EN. Walking through protein sequence space. J Theor Biol 2007; 244:77-80. [DOI: 10.1016/j.jtbi.2006.07.027] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2006] [Revised: 07/03/2006] [Accepted: 07/26/2006] [Indexed: 11/30/2022]
|
22
|
Oberai A, Ihm Y, Kim S, Bowie JU. A limited universe of membrane protein families and folds. Protein Sci 2006; 15:1723-34. [PMID: 16815920 PMCID: PMC2242558 DOI: 10.1110/ps.062109706] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Revised: 03/27/2006] [Accepted: 03/27/2006] [Indexed: 10/24/2022]
Abstract
One of the goals of structural genomics is to obtain a structural representative of almost every fold in nature. A recent estimate suggests that 70%-80% of soluble protein domains identified in the first 1000 genome sequences should be covered by about 25,000 structures-a reasonably achievable goal. As no current estimates exist for the number of membrane protein families, however, it is not possible to know whether family coverage is a realistic goal for membrane proteins. Here we find that virtually all polytopic helical membrane protein families are present in the already known sequences so we can make an estimate of the total number of families. We find that only approximately 700 polytopic membrane protein families account for 80% of structured residues and approximately 1700 cover 90% of structured residues. While apparently a finite and reachable goal, we estimate that it will likely take more than three decades to obtain the structures needed for 90% residue coverage, if current trends continue.
Collapse
Affiliation(s)
- Amit Oberai
- Department of Chemistry and Biochemistry, UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095-1570, USA
| | | | | | | |
Collapse
|
23
|
Abstract
Classification of proteins into families of homologous sequences constitutes the basis of functional analysis or of evolutionary studies. Here we present INVertebrate HOmologous GENes (INVHOGEN), a database combining the available invertebrate protein genes from UniProt (consisting of Swiss-Prot and TrEMBL) into gene families. For each family INVHOGEN provides a multiple protein alignment, a maximum likelihood based phylogenetic tree and taxonomic information about the sequences. It is possible to download the corresponding GenBank flatfiles, the alignment and the tree in Newick format. Sequences and related information have been structured in an ACNUC database under a client/server architecture. Thus, complex selections can be performed. An external graphical tool (FamFetch) allows access to the data to evaluate homology relationships between genes and distinguish orthologous from paralogous sequences. Thus, INVHOGEN complements the well-known HOVERGEN database. The databank is available at .
Collapse
Affiliation(s)
- Ingo Paulsen
- Department of Bioinformatics, Institute for Computer Sciences, Heinrich-Heine-University Duesseldorf, Universitaetsstrasse 1, 40225 Duesseldorf, Germany.
| | | |
Collapse
|
24
|
Uchiyama I. Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res 2006; 34:647-58. [PMID: 16436801 PMCID: PMC1351371 DOI: 10.1093/nar/gkj448] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.
Collapse
Affiliation(s)
- Ikuo Uchiyama
- National Institute for Basic Biology, National Institutes of Natural Sciences, Nishigonaka 38, Myodaiji, Okazaki, Aichi 444-8585 Japan.
| |
Collapse
|
25
|
Krause A, Stoye J, Vingron M. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics 2005; 6:15. [PMID: 15663796 PMCID: PMC547898 DOI: 10.1186/1471-2105-6-15] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2004] [Accepted: 01/22/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. RESULTS We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/. CONCLUSIONS Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
Collapse
Affiliation(s)
- Antje Krause
- Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany
- TFH Wildau, Bahnhofstrasse 1, 15745 Wildau, Germany
| | - Jens Stoye
- Universität Bielefeld, Technische Fakultät, AG Genominformatik, Postfach 100131, 33501 Bielefeld, Germany
| | - Martin Vingron
- Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany
| |
Collapse
|
26
|
Stevens FJ. Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation. J Mol Recognit 2005; 18:139-49. [PMID: 15558595 DOI: 10.1002/jmr.721] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.
Collapse
Affiliation(s)
- F J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
27
|
Reichelt J, Dieterich G, Kvesic M, Schomburg D, Heinz DW. BRAGI: linking and visualization of database information in a 3D viewer and modeling tool. Bioinformatics 2004; 21:1291-3. [PMID: 15546941 DOI: 10.1093/bioinformatics/bti138] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
BRAGI is a well-established package for viewing and modeling of three-dimensional (3D) structures of biological macromolecules. A new version of BRAGI has been developed that is supported on Windows, Linux and SGI. The user interface has been rewritten to give the standard 'look and feel' of the chosen operating system and to provide a more intuitive, easier usage. A large number of new features have been added. Information from public databases such as SWISS-PROT, InterPro, DALI and OMIM can be displayed in the 3D viewer. Structures can be searched for homologous sequences using the NCBI BLAST server.
Collapse
Affiliation(s)
- Joachim Reichelt
- Division of Structural Biology, German Research Centre for Biotechnology (GBF) Mascheroder Weg 1, D-38124, Braunschweig, Germany.
| | | | | | | | | |
Collapse
|
28
|
|
29
|
Abstract
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Collapse
Affiliation(s)
- Vamsi Veeramachaneni
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | | |
Collapse
|
30
|
Ouzounis CA, Coulson RMR, Enright AJ, Kunin V, Pereira-Leal JB. Classification schemes for protein structure and function. Nat Rev Genet 2003; 4:508-19. [PMID: 12838343 DOI: 10.1038/nrg1113] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We examine the structural and functional classifications of the protein universe, providing an overview of the existing classification schemes, their features and inter-relationships. We argue that a unified scheme should be based on a natural classification approach and that more comparative analyses of the present schemes are required both to understand their limitations and to help delimit the number of known protein folds and their corresponding functional roles in cells.
Collapse
Affiliation(s)
- Christos A Ouzounis
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | | | | | |
Collapse
|
31
|
Abstract
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.
Collapse
Affiliation(s)
- Andreas Heger
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
32
|
Copley RR, Ponting CP, Schultz J, Bork P. Sequence analysis of multidomain proteins: past perspectives and future directions. ADVANCES IN PROTEIN CHEMISTRY 2003; 61:75-98. [PMID: 12461821 DOI: 10.1016/s0065-3233(02)61002-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
33
|
Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002; 30:1575-84. [PMID: 11917018 PMCID: PMC101833 DOI: 10.1093/nar/30.7.1575] [Citation(s) in RCA: 2415] [Impact Index Per Article: 105.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
Collapse
Affiliation(s)
- A J Enright
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
34
|
Getz G, Vendruscolo M, Sachs D, Domany E. Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins 2002; 46:405-15. [PMID: 11835515 DOI: 10.1002/prot.1176] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We present an automated procedure to assign CATH and SCOP classifications to proteins whose FSSP score is available. CATH classification is assigned down to the topology level, and SCOP classification is assigned to the fold level. Because the FSSP database is updated weekly, this method makes it possible to update also CATH and SCOP with the same frequency. Our predictions have a nearly perfect success rate when ambiguous cases are discarded. These ambiguous cases are intrinsic in any protein structure classification that relies on structural information alone. Hence, we introduce the "twilight zone for structure classification." We further suggest that to resolve these ambiguous cases, other criteria of classification, based also on information about sequence and function, must be used.
Collapse
Affiliation(s)
- Gad Getz
- Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot, Israel
| | | | | | | |
Collapse
|
35
|
Blundell TL, Mizuguchi K. Structural genomics: an overview. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2001; 73:289-95. [PMID: 11063776 DOI: 10.1016/s0079-6107(00)00008-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- T L Blundell
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, 2 1GA, Cambridge CB, UK.
| | | |
Collapse
|
36
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447210 DOI: 10.1002/cfg.57] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|