1
|
Cantó-Pastor A, Mason GA, Brady SM, Provart NJ. Arabidopsis bioinformatics: tools and strategies. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2021; 108:1585-1596. [PMID: 34695270 DOI: 10.1111/tpj.15547] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 10/01/2021] [Accepted: 10/19/2021] [Indexed: 06/13/2023]
Abstract
The sequencing of the Arabidopsis thaliana genome 21 years ago ushered in the genomics era for plant research. Since then, an incredible variety of bioinformatic tools permit easy access to large repositories of genomic, transcriptomic, proteomic, epigenomic and other '-omic' data. In this review, we cover some more recent tools (and highlight the 'classics') for exploring such data in order to help formulate quality, testable hypotheses, often without having to generate new experimental data. We cover tools for examining gene expression and co-expression patterns, undertaking promoter analyses and gene set enrichment analyses, and exploring protein-protein and protein-DNA interactions. We will touch on tools that integrate different data sets at the end of the article.
Collapse
Affiliation(s)
- Alex Cantó-Pastor
- Department of Plant Biology and Genome Center, University of California Davis, 1 Shields Avenue, Davis, CA, 95616, USA
| | - G Alex Mason
- Department of Plant Biology and Genome Center, University of California Davis, 1 Shields Avenue, Davis, CA, 95616, USA
| | - Siobhan M Brady
- Department of Plant Biology and Genome Center, University of California Davis, 1 Shields Avenue, Davis, CA, 95616, USA
| | - Nicholas J Provart
- Department of Cell and Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks Street, Toronto, ON, M5S 3B2, Canada
| |
Collapse
|
2
|
Xavier D, Crespo B, Fuentes-Fernández R. A rule-based expert system for inferring functional annotation. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.05.055] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
3
|
How to learn about gene function: text-mining or ontologies? Methods 2014; 74:3-15. [PMID: 25088781 DOI: 10.1016/j.ymeth.2014.07.004] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Revised: 07/01/2014] [Accepted: 07/09/2014] [Indexed: 12/31/2022] Open
Abstract
As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic insight into the underlying molecular systems; (2) provide better follow-up experimental testing and treatment options, and (3) better manage gene lists derived from organisms that are not well-studied. We discuss some promising approaches that may help achieve these advances, especially the use of extended dictionaries of biomedical concepts and molecular mechanisms, as well as greater use of annotation benchmarks.
Collapse
|
4
|
BAZZAN ANALC, DUARTE ROGÉRIO, PITINGA ABNERN, SCHROEDER LUCIANAF, DE A. SOUTO FARLON, DA SILVA SÉRGIOCERONI. ATUCG — AN AGENT–BASED ENVIRONMENT FOR AUTOMATIC ANNOTATION OF GENOMES. INT J COOP INF SYST 2012. [DOI: 10.1142/s0218843003000735] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This work reports on the ATUCG environment (Agent-based environmenT for aUtomatiC annotation of Genomes). It consists of three layers, each having several agents in charge of performing repetitive and time-consuming tasks. Layer I aims at automating the tasks behind the process of finding ORFs (Open Reading Frames). Layer II (the core of our approach) is associated with three main tasks: extraction and formatting of data, automatic annotation of data regarding profiles or families of proteins, and generation and validation of rules to automatically annotate the Keywords field in the SWISS-PROT database. Layer III permits the user to check the correctness of the automatic annotation. This environment is being designed having the sequencing of the Mycoplasma hyopneumoniae in mind. Thus examples are presented using data of organisms of the Mycoplasmataceae family. We have concentrated the developments in layer II because this is the most general one and because it focusses on machine learning algorithms, a characteristic which is not usual in annotation systems. Results regarding this layer show that with learning (individual or colaborative), agents are able to generate rules for annotation which achieve better results than those reported in the literature.
Collapse
Affiliation(s)
- ANA L. C. BAZZAN
- Instituto de Informática, Univ. Fed. do Rio Grande do Sul, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
| | - ROGÉRIO DUARTE
- Instituto de Informática, Univ. Fed. do Rio Grande do Sul, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
| | - ABNER N. PITINGA
- Instituto de Informática, Univ. Fed. do Rio Grande do Sul, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
| | - LUCIANA F. SCHROEDER
- Instituto de Informática, Univ. Fed. do Rio Grande do Sul, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
| | - FARLON DE A. SOUTO
- Instituto de Informática, Univ. Fed. do Rio Grande do Sul, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
| | - SÉRGIO CERONI DA SILVA
- Centro de Biotecnologia and Fac. de Veterinária, Univ. Fed. do Rio Grande do Sul, 91501–970, Porto Alegre, RS, Brazil
| |
Collapse
|
5
|
NGUYEN CAO, MANNINO MICHAEL, GARDINER KATHELEEN, CIOS KRZYSZTOFJ. ClusFCM: AN ALGORITHM FOR PREDICTING PROTEIN FUNCTIONS USING HOMOLOGIES AND PROTEIN INTERACTIONS. J Bioinform Comput Biol 2011; 6:203-22. [DOI: 10.1142/s0219720008003333] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2007] [Revised: 09/26/2007] [Accepted: 10/24/2007] [Indexed: 11/18/2022]
Abstract
We introduce a new algorithm, called ClusFCM, which combines techniques of clustering and fuzzy cognitive maps (FCM) for prediction of protein functions. ClusFCM takes advantage of protein homologies and protein interaction network topology to improve low recall predictions associated with existing prediction methods. ClusFCM exploits the fact that proteins of known function tend to cluster together and deduce functions not only through their direct interaction with other proteins, but also from other proteins in the network. We use ClusFCM to annotate protein functions for Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), and Drosophila melanogaster (fly) using protein–protein interaction data from the General Repository for Interaction Datasets (GRID) database and functional labels from Gene Ontology (GO) terms. The algorithm's performance is compared with four state-of-the-art methods for function prediction — Majority, χ2 statistics, Markov random field (MRF), and FunctionalFlow — using measures of Matthews correlation coefficient, harmonic mean, and area under the receiver operating characteristic (ROC) curves. The results indicate that ClusFCM predicts protein functions with high recall while not lowering precision. Supplementary information is available at .
Collapse
Affiliation(s)
- CAO NGUYEN
- Virginia Commonwealth University, VA 23238, USA
| | | | | | - KRZYSZTOF J. CIOS
- Virginia Commonwealth University, VA 23238, USA
- University of Colorado Boulder, Boulder, CO 80309, USA
- Polish Academy of Sciences, Poland
| |
Collapse
|
6
|
Hawkins T, Kihara D. FUNCTION PREDICTION OF UNCHARACTERIZED PROTEINS. J Bioinform Comput Biol 2011; 5:1-30. [PMID: 17477489 DOI: 10.1142/s0219720007002503] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2006] [Revised: 09/23/2006] [Accepted: 10/10/2006] [Indexed: 11/18/2022]
Abstract
Function prediction of uncharacterized protein sequences generated by genome projects has emerged as an important focus for computational biology. We have categorized several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association (genomic context)-, interaction (cellular context)-, process (metabolic context)-, and proteomics-experiment-based methods. Because they incorporate structural and experimental data that is not used in sequence-based methods, they can provide additional accuracy and reliability to protein function prediction. Here, first we review the definition of protein function. Then the recent developments of these methods are introduced with special focus on the type of predictions that can be made. The need for further development of comprehensive systems biology techniques that can utilize the ever-increasing data presented by the genomics and proteomics communities is emphasized. For the readers' convenience, tables of useful online resources in each category are included. The role of computational scientists in the near future of biological research and the interplay between computational and experimental biology are also addressed.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| | | |
Collapse
|
7
|
Yu GX. RULEMINER: A KNOWLEDGE SYSTEM FOR SUPPORTING HIGH-THROUGHPUT PROTEIN FUNCTION ANNOTATIONS. J Bioinform Comput Biol 2011; 2:615-37. [PMID: 15617156 DOI: 10.1142/s0219720004000752] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2003] [Revised: 03/23/2004] [Accepted: 03/24/2004] [Indexed: 11/18/2022]
Abstract
In this paper, we present RuleMiner, a knowledge system to facilitate a seamless integration of multi-sequence analysis tools and define profile-based rules for supporting high-throughput protein function annotations. This system consists of three essential components, Protein Function Groups (PFGs), PFG profiles and rules. The PFGs, established from an integrated analysis of current knowledge of protein functions from Swiss-Prot database and protein family-based sequence classifications, cover all possible cellular functions available in the database. The PFG profiles illustrate detailed protein features in the PFGs as in sequence conservations, the occurrences of sequence-based motifs, domains and species distributions. The rules, extracted from the PFG profiles, describe the clear relationships between these PFGs and all possible features. As a result, the RuleMiner is able to provide an enhanced capability for protein function analysis, such as results from the integrated sequence analysis tools for given proteins can be comparatively analyzed due to the clear feature-PFG relationships. Also, much needed guidance is readily available for such analysis. If the rules describe one-to-one (unique) relationships between the protein features and the PFGs, then these features can be utilized as unique functional identifiers and cellular functions of unknown proteins can be reliably determined. Otherwise, additional information has to be provided.
Collapse
Affiliation(s)
- Gong-Xin Yu
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
8
|
Abstract
As the genomics era matures, the availability of complete microbial genome sequences is facilitating computational approaches to understand bacterial genomes and DNA structure/function relationships. From the genome of pathogens, we can derive invaluable information on potential targets for new antimicrobial agents. Advancements in high-throughput 'omics' technologies and the availability of multiple isolates of the same species have significantly changed the time frame and scope for identifying novel therapeutic targets. This article aims to discuss selected aspects of the bacterial genome, and advocates 'omics'-based techniques to advance the discovery of new therapeutic targets against extracellular bacterial pathogens.
Collapse
Affiliation(s)
- Nagathihalli S Nagaraj
- Department of Surgery, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
| | | |
Collapse
|
9
|
Schröder A, Eichner J, Supper J, Eichner J, Wanke D, Henneges C, Zell A. Predicting DNA-binding specificities of eukaryotic transcription factors. PLoS One 2010; 5:e13876. [PMID: 21152420 PMCID: PMC2994704 DOI: 10.1371/journal.pone.0013876] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2010] [Accepted: 10/14/2010] [Indexed: 11/18/2022] Open
Abstract
Today, annotated amino acid sequences of more and more transcription factors (TFs) are readily available. Quantitative information about their DNA-binding specificities, however, are hard to obtain. Position frequency matrices (PFMs), the most widely used models to represent binding specificities, are experimentally characterized only for a small fraction of all TFs. Even for some of the most intensively studied eukaryotic organisms (i.e., human, rat and mouse), roughly one-sixth of all proteins with annotated DNA-binding domain have been characterized experimentally. Here, we present a new method based on support vector regression for predicting quantitative DNA-binding specificities of TFs in different eukaryotic species. This approach estimates a quantitative measure for the PFM similarity of two proteins, based on various features derived from their protein sequences. The method is trained and tested on a dataset containing 1 239 TFs with known DNA-binding specificity, and used to predict specific DNA target motifs for 645 TFs with high accuracy.
Collapse
Affiliation(s)
- Adrian Schröder
- Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Tübingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
10
|
Henderson-MacLennan NK, Papp JC, Talbot CC, McCabe ER, Presson AP. Pathway analysis software: annotation errors and solutions. Mol Genet Metab 2010; 101:134-40. [PMID: 20663702 PMCID: PMC2950253 DOI: 10.1016/j.ymgme.2010.06.005] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Revised: 06/04/2010] [Accepted: 06/05/2010] [Indexed: 11/20/2022]
Abstract
Genetic databases contain a variety of annotation errors that often go unnoticed due to the large size of modern genetic data sets. Interpretation of these data sets requires bioinformatics tools that may contribute to this problem. While providing gene symbol annotations for identifiers (IDs) such as microarray probe set, RefSeq, GenBank, and Entrez Gene is seemingly trivial, the accuracy is fundamental to any subsequent conclusions. We examine gene symbol annotations and results from three commercial pathway analysis software (PAS) packages: Ingenuity Pathways Analysis, GeneGO, and Pathway Studio. We compare gene symbol annotations and canonical pathway results over time and among different input ID types. We find that PAS results can be affected by variation in gene symbol annotations across software releases and the input ID type analyzed. As a result, we offer suggestions for using commercial PAS and reporting microarray results to improve research quality. We propose a wiki type website to facilitate communication of bioinformatics software problems within the scientific community.
Collapse
Affiliation(s)
| | - Jeanette C. Papp
- Department of Biostatistics, David Geffen School of Medicine at UCLA
| | - C. Conover Talbot
- Institute for Basic Biomedical Sciences, The Johns Hopkins School of Medicine, Baltimore, MD 21205
| | - Edward R.B. McCabe
- Department of Pediatrics, David Geffen School of Medicine at UCLA
- Department of Human Genetics, David Geffen School of Medicine at UCLA
- Department of Bioengineering, Henry Samueli School of Engineering and Applied Science
- California NanoSystems Institute, University of California, Los Angeles, Los Angeles, California 90095
| | - Angela P. Presson
- Department of Pediatrics, David Geffen School of Medicine at UCLA
- Department of Biostatistics, David Geffen School of Medicine at UCLA
- Correspondence: Angela P. Presson, Adjunct Assistant Professor, Departments of Biostatistics and Pediatrics, 51-236A CHS, UCLA School of Public Health, Los Angeles, CA 90095-7088, phone 310-825-5916,
| |
Collapse
|
11
|
Pérez AJ, Rodríguez A, Trelles O, Thode G. A computational strategy for protein function assignment which addresses the multidomain problem. Comp Funct Genomics 2010; 3:423-40. [PMID: 18629055 PMCID: PMC2447339 DOI: 10.1002/cfg.208] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2002] [Accepted: 08/12/2002] [Indexed: 11/25/2022] Open
Abstract
A method for assigning functions to unknown sequences based on finding correlations between short signals and functional annotations in a protein database is presented.
This approach is based on keyword (KW) and feature (FT) information stored in
the SWISS-PROT database. The former refers to particular protein characteristics
and the latter locates these characteristics at a specific sequence position. In this way,
a certain keyword is only assigned to a sequence if sequence similarity is found in
the position described by the FT field. Exhaustive tests performed over sequences
with homologues (cluster set) and without homologues (singleton set) in the database
show that assigning functions is much ’cleaner’ when information about domains (FT
field) is used, than when only the keywords are used.
Collapse
Affiliation(s)
- A J Pérez
- Genetics Department, University of Málaga, Málaga 29071, Spain.
| | | | | | | |
Collapse
|
12
|
Analysis of Transcripts Expressed in One-Day-Old Larvae and Fifth Instar Silk Glands of Tasar Silkworm, Antheraea mylitta. Comp Funct Genomics 2010:246738. [PMID: 20454581 PMCID: PMC2864506 DOI: 10.1155/2010/246738] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2009] [Accepted: 02/03/2010] [Indexed: 11/17/2022] Open
Abstract
Antheraea mylitta is one of the wild nonmulberry silkworms, which produces tasar silk. An EST project has been undertaken to understand the gene expression profile of A. mylitta silk gland. Two cDNA libraries, one from the whole bodies of one-day-old larvae and the other from the silkglands of fifth instar larvae, were constructed and sequenced. A total of 2476 good-quality ESTs (1239 clones) were obtained and grouped into 648 clusters containing 390 contigs and 258 singletons to represent 467 potential unigenes. Forty-five sequences contained putative coding region, and represented potentially novel genes. Among the 648 clusters, 241 were categorized according to Gene Ontology hierarchy and showed presence of several silk and immune-related genes. The A. mylitta ESTs have been organized into a freely available online database “AmyBASE”. These data provide an initial insight into the A. mylitta transcriptome and help to understand the molecular mechanism of silk protein production in a Lepidopteran species.
Collapse
|
13
|
Hsiao TL, Revelles O, Chen L, Sauer U, Vitkup D. Automatic policing of biochemical annotations using genomic correlations. Nat Chem Biol 2009; 6:34-40. [PMID: 19935659 PMCID: PMC2935526 DOI: 10.1038/nchembio.266] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2009] [Accepted: 09/10/2009] [Indexed: 11/09/2022]
Abstract
With the increasing role of computational tools in the analysis of sequenced genomes, there is an urgent need to maintain high accuracy of functional annotations. Misannotations can be easily generated and propagated through databases by functional transfer based on sequence homology. We developed and optimized an automatic policing method to detect biochemical misannotations using context genomic correlations. The method works by finding genes with unusually weak genomic correlations in their assigned network positions. We demonstrate the accuracy of the method using a cross-validated approach. In addition, we show that the method identifies a significant number of potential misannotations in Bacillus subtilis, including metabolic assignments already shown to be incorrect experimentally. The experimental analysis of the mispredicted genes forming the leucine degradation pathway in B. subtilis demonstrates that computational policing tools can generate important biological hypotheses.
Collapse
Affiliation(s)
- Tzu-Lin Hsiao
- Center for Computational Biology and Bioinformatics and Department of Biomedical Informatics, Columbia University, Irving Cancer Research Center, New York, New York, USA
| | | | | | | | | |
Collapse
|
14
|
Tsafnat G, Coiera E, Partridge SR, Schaeffer J, Iredell JR. Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinformatics 2009; 10:281. [PMID: 19735578 PMCID: PMC3087341 DOI: 10.1186/1471-2105-10-281] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Accepted: 09/08/2009] [Indexed: 01/13/2023] Open
Abstract
Background Gene discovery algorithms typically examine sequence data for low level patterns. A novel method to computationally discover higher order DNA structures is presented, using a context sensitive grammar. The algorithm was applied to the discovery of gene cassettes associated with integrons. The discovery and annotation of antibiotic resistance genes in such cassettes is essential for effective monitoring of antibiotic resistance patterns and formulation of public health antibiotic prescription policies. Results We discovered two new putative gene cassettes using the method, from 276 integron features and 978 GenBank sequences. The system achieved κ = 0.972 annotation agreement with an expert gold standard of 300 sequences. In rediscovery experiments, we deleted 789,196 cassette instances over 2030 experiments and correctly relabelled 85.6% (α ≥ 95%, E ≤ 1%, mean sensitivity = 0.86, specificity = 1, F-score = 0.93), with no false positives. Error analysis demonstrated that for 72,338 missed deletions, two adjacent deleted cassettes were labeled as a single cassette, increasing performance to 94.8% (mean sensitivity = 0.92, specificity = 1, F-score = 0.96). Conclusion Using grammars we were able to represent heuristic background knowledge about large and complex structures in DNA. Importantly, we were also able to use the context embedded in the model to discover new putative antibiotic resistance gene cassettes. The method is complementary to existing automatic annotation systems which operate at the sequence level.
Collapse
Affiliation(s)
- Guy Tsafnat
- Centre for Health Informatics, Univ. of New South Wales, Sydney, NSW 2052, Australia.
| | | | | | | | | |
Collapse
|
15
|
Díaz-Mejía JJ, Babu M, Emili A. Computational and experimental approaches to chart the Escherichia coli cell-envelope-associated proteome and interactome. FEMS Microbiol Rev 2008; 33:66-97. [PMID: 19054114 PMCID: PMC2704936 DOI: 10.1111/j.1574-6976.2008.00141.x] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
The bacterial cell-envelope consists of a complex arrangement of lipids, proteins and carbohydrates that serves as the interface between a microorganism and its environment or, with pathogens, a human host. Escherichia coli has long been investigated as a leading model system to elucidate the fundamental mechanisms underlying microbial cell-envelope biology. This includes extensive descriptions of the molecular identities, biochemical activities and evolutionary trajectories of integral transmembrane proteins, many of which play critical roles in infectious disease and antibiotic resistance. Strikingly, however, only half of the c. 1200 putative cell-envelope-related proteins of E. coli currently have experimentally attributed functions, indicating an opportunity for discovery. In this review, we summarize the state of the art of computational and proteomic approaches for determining the components of the E. coli cell-envelope proteome, as well as exploring the physical and functional interactions that underlie its biogenesis and functionality. We also provide a comprehensive comparative benchmarking analysis on the performance of different bioinformatic and proteomic methods commonly used to determine the subcellular localization of bacterial proteins.
Collapse
Affiliation(s)
- Juan Javier Díaz-Mejía
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | | | | |
Collapse
|
16
|
Lee EY, Choi DS, Kim KP, Gho YS. Proteomics in gram-negative bacterial outer membrane vesicles. MASS SPECTROMETRY REVIEWS 2008; 27:535-555. [PMID: 18421767 DOI: 10.1002/mas.20175] [Citation(s) in RCA: 223] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Gram-negative bacteria constitutively secrete outer membrane vesicles (OMVs) into the extracellular milieu. Recent research in this area has revealed that OMVs may act as intercellular communicasomes in polyspecies communities by enhancing bacterial survival and pathogenesis in hosts. However, the mechanisms of vesicle formation and the pathophysiological roles of OMVs have not been clearly defined. While it is obvious that mass spectrometry-based proteomics offers great opportunities for improving our knowledge of bacterial OMVs, limited proteomic data are available for OMVs. The present review aims to give an overview of the previous biochemical, biological, and proteomic studies in the emerging field of bacterial OMVs, and to give future directions for high-throughput and comparative proteomic studies of OMVs that originate from diverse Gram-negative bacteria under various environmental conditions. This article will hopefully stimulate further efforts to construct a comprehensive proteome database of bacterial OMVs that will help us not only to elucidate the biogenesis and functions of OMVs but also to develop diagnostic tools, vaccines, and antibiotics effective against pathogenic bacteria.
Collapse
Affiliation(s)
- Eun-Young Lee
- Department of Life Science and Division of Molecular and Life Sciences, Pohang University of Science and Technology, Pohang, Republic of Korea
| | | | | | | |
Collapse
|
17
|
Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688-719. [PMID: 18948295 PMCID: PMC2588523 DOI: 10.1093/nar/gkn668] [Citation(s) in RCA: 534] [Impact Index Per Article: 33.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
18
|
Leontovich AM, Tokmachev KY, van Houwelingen HC. The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem. BMC Bioinformatics 2008; 9:31. [PMID: 18211675 PMCID: PMC2267706 DOI: 10.1186/1471-2105-9-31] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2007] [Accepted: 01/22/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper discusses the problem of automated annotation. It is a continuation of the previous work on the A4-algorithm (Adaptive algorithm of automated annotation) developed by Leontovich and others. RESULTS A number of new statistics for the automated annotation of biological sequences is introduced. All these statistics are based on the likelihood ratio criterion. CONCLUSION Some of the statistics yield a prediction quality that is significantly higher (up to 1.5 times higher) in comparison with the results obtained with the A4-procedure.
Collapse
Affiliation(s)
- Andrey M Leontovich
- Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow 119899, Russia.
| | | | | |
Collapse
|
19
|
Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008; 24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany.
| | | | | | | | | |
Collapse
|
20
|
Sarac OS, Gürsoy-Yüzügüllü O, Cetin-Atalay R, Atalay V. Subsequence-based feature map for protein function classification. Comput Biol Chem 2007; 32:122-30. [PMID: 18243801 DOI: 10.1016/j.compbiolchem.2007.11.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2007] [Accepted: 11/30/2007] [Indexed: 11/19/2022]
Abstract
Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.
Collapse
Affiliation(s)
- Omer Sinan Sarac
- Department of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey
| | | | | | | |
Collapse
|
21
|
Tamaki S, Arakawa K, Kono N, Tomita M. Restauro-G: a rapid genome re-annotation system for comparative genomics. GENOMICS PROTEOMICS & BIOINFORMATICS 2007; 5:53-8. [PMID: 17572364 PMCID: PMC5054091 DOI: 10.1016/s1672-0229(07)60014-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
Abstract
Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/under the GNU General Public License.
Collapse
Affiliation(s)
- Satoshi Tamaki
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| | - Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Corresponding author.
| | - Nobuaki Kono
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
- Department of Environmental Information, Keio University, Fujisawa 252-8520, Japan
| |
Collapse
|
22
|
Fuhrer T, Chen L, Sauer U, Vitkup D. Computational prediction and experimental verification of the gene encoding the NAD+/NADP+-dependent succinate semialdehyde dehydrogenase in Escherichia coli. J Bacteriol 2007; 189:8073-8. [PMID: 17873044 PMCID: PMC2168661 DOI: 10.1128/jb.01027-07] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Although NAD(+)-dependent succinate semialdehyde dehydrogenase activity was first described in Escherichia coli more than 25 years ago, the responsible gene has remained elusive so far. As an experimental proof of concept for a gap-filling algorithm for metabolic networks developed earlier, we demonstrate here that the E. coli gene yneI is responsible for this activity. Our biochemical results demonstrate that the yneI-encoded succinate semialdehyde dehydrogenase can use either NAD(+) or NADP(+) to oxidize succinate semialdehyde to succinate. The gene is induced by succinate semialdehyde, and expression data indicate that yneI plays a unique physiological role in the general nitrogen metabolism of E. coli. In particular, we demonstrate using mutant growth experiments that the yneI gene has an important, but not essential, role during growth on arginine and probably has an essential function during growth on putrescine as the nitrogen source. The NADP(+)-dependent succinate semialdehyde dehydrogenase activity encoded by the functional homolog gabD appears to be important for nitrogen metabolism under N limitation conditions. The yneI-encoded activity, in contrast, functions primarily as a valve to prevent toxic accumulation of succinate semialdehyde. Analysis of available genome sequences demonstrated that orthologs of both yneI and gabD are broadly distributed across phylogenetic space.
Collapse
Affiliation(s)
- Tobias Fuhrer
- Institute of Molecular Systems Biology, ETH Zurich, CH-8093 Zurich, Switzerland
| | | | | | | |
Collapse
|
23
|
Tsoka S. Computational methodologies for genome evolution and functional association. Comput Chem Eng 2007. [DOI: 10.1016/j.compchemeng.2006.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
24
|
Brown DP, Krishnamurthy N, Sjölander K. Automated protein subfamily identification and classification. PLoS Comput Biol 2007; 3:e160. [PMID: 17708678 PMCID: PMC1950344 DOI: 10.1371/journal.pcbi.0030160] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2006] [Accepted: 06/25/2007] [Indexed: 11/22/2022] Open
Abstract
Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.
Collapse
Affiliation(s)
- Duncan P Brown
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Nandini Krishnamurthy
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Kimmen Sjölander
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| |
Collapse
|
25
|
Affiliation(s)
- Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenchaftszentrum Weihenstephan, 85350 Freising, Germany
| |
Collapse
|
26
|
Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics 2007; 8:222. [PMID: 17620139 PMCID: PMC1949826 DOI: 10.1186/1471-2164-8-222] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Accepted: 07/09/2007] [Indexed: 11/16/2022] Open
Abstract
Background Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way. Results We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs. Conclusion Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.
Collapse
|
27
|
Kalia VC, Rani A, Lal S, Cheema S, Raut CP. Combing databases reveals potential antibiotic producers. Expert Opin Drug Discov 2007; 2:211-24. [DOI: 10.1517/17460441.2.2.211] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
28
|
Wishart DS. Discovering drug targets through the web. COMPARATIVE BIOCHEMISTRY AND PHYSIOLOGY D-GENOMICS & PROTEOMICS 2006; 2:9-17. [PMID: 20483274 DOI: 10.1016/j.cbd.2006.01.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2005] [Revised: 01/28/2006] [Accepted: 01/30/2006] [Indexed: 11/25/2022]
Abstract
Traditionally, drug-target discovery is a "wet-bench" experimental process, depending on carefully designed genetic screens, biochemical tests and cellular assays to identify proteins and genes that are associated with a particular disease or condition. However, recent advances in DNA sequencing, transcript profiling, protein identification and protein quantification are leading to a flood of genomic and proteomic data that is, or potentially could be, linked to disease data. The quantity of data generated by these high throughput methods is forcing scientists to re-think the way they do traditional drug-target discovery. In particular it is leading them more and more towards identifying potential drug targets using computers. In fact, drug-target identification is now being done as much on the desk-top as on the bench-top. This review focuses on describing how drug-target discovery can be done in silico (i.e. via computer) using a variety of bioinformatic resources that are freely available on the web. Specifically, it highlights a number of web-accessible sequence databases, automated genome annotation tools, text mining tools; and integrated drug/sequence databases that can be used to identify drug targets for both endogenous (genetic and epigenetic) diseases as well as exogenous (infectious) diseases.
Collapse
Affiliation(s)
- David S Wishart
- Departments of Computing Science and Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E8
| |
Collapse
|
29
|
Zhou CLE, Lam MW, Smith JR, Zemla AT, Dyer MD, Kuczmarski TA, Vitalis EA, Slezak TR. MannDB - a microbial database of automated protein sequence analyses and evidence integration for protein characterization. BMC Bioinformatics 2006; 7:459. [PMID: 17044936 PMCID: PMC1622758 DOI: 10.1186/1471-2105-7-459] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2006] [Accepted: 10/17/2006] [Indexed: 11/16/2022] Open
Abstract
Background MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. Description MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-source tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. Conclusion MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high-priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports. Access to MannDB is freely available at .
Collapse
Affiliation(s)
- Carol L Ecale Zhou
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Marisa W Lam
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Jason R Smith
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Adam T Zemla
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Matthew D Dyer
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Thomas A Kuczmarski
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Elizabeth A Vitalis
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| | - Thomas R Slezak
- Lawrence Livermore National Laboratory, Pathogen Bio-informatics, Livermore, CA, USA
| |
Collapse
|
30
|
Ouyang Z, Isaacson R. Identification and characterization of a novel ABC iron transport system, fit, in Escherichia coli. Infect Immun 2006; 74:6949-56. [PMID: 16982838 PMCID: PMC1698097 DOI: 10.1128/iai.00866-06] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A putative ABC transporter, fit, with significant homology to several bacterial iron transporters was identified in Escherichia coli. The E. coli fit system consists of six genes designated fitA, -B, -C, -D, -E, and -R. Based on DNA sequence analysis, fit encodes an outer membrane protein (FitA), a periplasmic binding protein (FitE), two permease proteins (FitC and -D), an ATPase (FitB), and a hypothetical protein (FitR). Introduction of the E. coli fit system into E. coli strain K-12 increased intracellular iron content and transformed bacteria were more sensitive to streptonigrin, which suggested that fit transports iron in E. coli. Expression of fit was studied using a lacZ reporter assay. A functional, bidirectional promoter was identified in the intergenic region between genes fitA and fitB. The expression of the E. coli fit system was found to be induced by iron limitation and repressed when Fe(2+) was added to minimal medium. Several fit mutants were created in E. coli using an in vitro transposon mutagenesis strategy. Mutations in fit did not affect bacterial growth in iron-restricted media. Using a growth promotion test, it was found that fit was not able to transport enterobactin, ferrichrome, transferrin, and lactoferrin in E. coli.
Collapse
Affiliation(s)
- Zhiming Ouyang
- Department of Veterinary and Biomedical Sciences, University of Minnesota, 1971 Commonwealth Avenue, St. Paul, MN 55108, USA
| | | |
Collapse
|
31
|
Leontovich AM, Tokmachev KY. Ways to improve the prediction quality in the adaptive algorithm of automated annotation (A
4). Biophysics (Nagoya-shi) 2006. [DOI: 10.1134/s0006350906040038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
32
|
Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 2006; 34:3533-45. [PMID: 16855290 PMCID: PMC1524909 DOI: 10.1093/nar/gkl471] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license.
Collapse
Affiliation(s)
| | | | | | | | - S. Chaillou
- Flore Lactique et Environnement Carné, INRA78352 Jouy-en-Josas Cedex, France
| | | | - S. Penaud
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | - E. Maguin
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | | | | | - J-F Gibrat
- To whom correspondence should be addressed. Tel: +33 1 34 65 28 97; Fax: +33 1 34 65 29 01; E-mail:
| |
Collapse
|
33
|
Chiu SH, Chen CC, Yuan GF, Lin TH. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences. BMC Bioinformatics 2006; 7:304. [PMID: 16776838 PMCID: PMC1552092 DOI: 10.1186/1471-2105-7-304] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2005] [Accepted: 06/15/2006] [Indexed: 11/16/2022] Open
Abstract
Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
Collapse
Affiliation(s)
- Shih-Hau Chiu
- Bioresource Collection and Research Center, Food Industry Research and Development Institute, HsinChu, Taiwan
- Institute of Molecular Medicine/Department of Life Science, National Tsing Hua University, HsinChu, Taiwan
| | - Chien-Chi Chen
- Bioresource Collection and Research Center, Food Industry Research and Development Institute, HsinChu, Taiwan
| | - Gwo-Fang Yuan
- Bioresource Collection and Research Center, Food Industry Research and Development Institute, HsinChu, Taiwan
| | - Thy-Hou Lin
- Institute of Molecular Medicine/Department of Life Science, National Tsing Hua University, HsinChu, Taiwan
| |
Collapse
|
34
|
Arakawa K, Yamada Y, Shinoda K, Nakayama Y, Tomita M. GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes. BMC Bioinformatics 2006; 7:168. [PMID: 16553966 PMCID: PMC1435936 DOI: 10.1186/1471-2105-7-168] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2005] [Accepted: 03/23/2006] [Indexed: 11/18/2022] Open
Abstract
Background Successful realization of a "systems biology" approach to analyzing cells is a grand challenge for our understanding of life. However, current modeling approaches to cell simulation are labor-intensive, manual affairs, and therefore constitute a major bottleneck in the evolution of computational cell biology. Results We developed the Genome-based Modeling (GEM) System for the purpose of automatically prototyping simulation models of cell-wide metabolic pathways from genome sequences and other public biological information. Models generated by the GEM System include an entire Escherichia coli metabolism model comprising 968 reactions of 1195 metabolites, achieving 100% coverage when compared with the KEGG database, 92.38% with the EcoCyc database, and 95.06% with iJR904 genome-scale model. Conclusion The GEM System prototypes qualitative models to reduce the labor-intensive tasks required for systems biology research. Models of over 90 bacterial genomes are available at our web site.
Collapse
Affiliation(s)
- Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | - Yohei Yamada
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | - Kosaku Shinoda
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | - Yoichi Nakayama
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| |
Collapse
|
35
|
Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, König R. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006; 7:161. [PMID: 16549020 PMCID: PMC1434778 DOI: 10.1186/1471-2105-7-161] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2005] [Accepted: 03/20/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Vast progress in sequencing projects has called for annotation on a large scale. A Number of methods have been developed to address this challenging task. These methods, however, either apply to specific subsets, or their predictions are not formalised, or they do not provide precise confidence values for their predictions. DESCRIPTION We recently established a learning system for automated annotation, trained with a broad variety of different organisms to predict the standardised annotation terms from Gene Ontology (GO). Now, this method has been made available to the public via our web-service GOPET (Gene Ontology term Prediction and Evaluation Tool). It supplies annotation for sequences of any organism. For each predicted term an appropriate confidence value is provided. The basic method had been developed for predicting molecular function GO-terms. It is now expanded to predict biological process terms. This web service is available via http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar CONCLUSION Our web service gives experimental researchers as well as the bioinformatics community a valuable sequence annotation device. Additionally, GOPET also provides less significant annotation data which may serve as an extended discovery platform for the user.
Collapse
Affiliation(s)
- Arunachalam Vinayagam
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Coral del Val
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Falk Schubert
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Roland Eils
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute for Pharmacy and Molecular Biotechnology, University of Heidelberg, 69120 Heidelberg, Germany
| | - Karl-Heinz Glatting
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Sándor Suhai
- Department of Molecular Biophysics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
| | - Rainer König
- Division of Theoretical Bioinformatics, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69121 Heidelberg, Germany
- Department of Bioinformatics and Functional Genomics, Institute for Pharmacy and Molecular Biotechnology, University of Heidelberg, 69120 Heidelberg, Germany
| |
Collapse
|
36
|
Pócsi I, Miskei M, Karányi Z, Emri T, Ayoubi P, Pusztahelyi T, Balla G, Prade RA. Comparison of gene expression signatures of diamide, H2O2 and menadione exposed Aspergillus nidulans cultures--linking genome-wide transcriptional changes to cellular physiology. BMC Genomics 2005; 6:182. [PMID: 16368011 PMCID: PMC1352360 DOI: 10.1186/1471-2164-6-182] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2005] [Accepted: 12/20/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In addition to their cytotoxic nature, reactive oxygen species (ROS) are also signal molecules in diverse cellular processes in eukaryotic organisms. Linking genome-wide transcriptional changes to cellular physiology in oxidative stress-exposed Aspergillus nidulans cultures provides the opportunity to estimate the sizes of peroxide (O2(2-)), superoxide (O2*-) and glutathione/glutathione disulphide (GSH/GSSG) redox imbalance responses. RESULTS Genome-wide transcriptional changes triggered by diamide, H2O2 and menadione in A. nidulans vegetative tissues were recorded using DNA microarrays containing 3533 unique PCR-amplified probes. Evaluation of LOESS-normalized data indicated that 2499 gene probes were affected by at least one stress-inducing agent. The stress induced by diamide and H2O2 were pulse-like, with recovery after 1 h exposure time while no recovery was observed with menadione. The distribution of stress-responsive gene probes among major physiological functional categories was approximately the same for each agent. The gene group sizes solely responsive to changes in intracellular O2(2-), O2*- concentrations or to GSH/GSSG redox imbalance were estimated at 7.7, 32.6 and 13.0 %, respectively. Gene groups responsive to diamide, H2O2 and menadione treatments and gene groups influenced by GSH/GSSG, O2(2-) and O2*- were only partly overlapping with distinct enrichment profiles within functional categories. Changes in the GSH/GSSG redox state influenced expression of genes coding for PBS2 like MAPK kinase homologue, PSK2 kinase homologue, AtfA transcription factor, and many elements of ubiquitin tagging, cell division cycle regulators, translation machinery proteins, defense and stress proteins, transport proteins as well as many enzymes of the primary and secondary metabolisms. Meanwhile, a separate set of genes encoding transport proteins, CpcA and JlbA amino acid starvation-responsive transcription factors, and some elements of sexual development and sporulation was ROS responsive. CONCLUSION The existence of separate O2(2-), O2*- and GSH/GSSG responsive gene groups in a eukaryotic genome has been demonstrated. Oxidant-triggered, genome-wide transcriptional changes should be analyzed considering changes in oxidative stress-responsive physiological conditions and not correlating them directly to the chemistry and concentrations of the oxidative stress-inducing agent.
Collapse
Affiliation(s)
- István Pócsi
- Department of Microbiology and Biotechnology, Faculty of Science, University of Debrecen, P.O.Box 63, H-4010 Debrecen, Hungary
| | - Márton Miskei
- Department of Microbiology and Biotechnology, Faculty of Science, University of Debrecen, P.O.Box 63, H-4010 Debrecen, Hungary
| | - Zsolt Karányi
- Department of Medicine, Faculty of Medicine, University of Debrecen, P.O. Box 19, H-4012 Debrecen, Hungary
| | - Tamás Emri
- Department of Microbiology and Biotechnology, Faculty of Science, University of Debrecen, P.O.Box 63, H-4010 Debrecen, Hungary
| | - Patricia Ayoubi
- Department of Biochemistry and Molecular Biology, Oklahoma State University, 348E Noble Research Center, Stillwater, OK 74078, USA
| | - Tünde Pusztahelyi
- Department of Microbiology and Biotechnology, Faculty of Science, University of Debrecen, P.O.Box 63, H-4010 Debrecen, Hungary
| | - György Balla
- Department of Neonatology, Faculty of Medicine, University of Debrecen, P.O.Box 37; H-4012 Debrecen, Hungary
| | - Rolf A Prade
- Department of Microbiology and Molecular Genetics, Oklahoma State University, 307 LSE, Stillwater, OK 74078, USA
| |
Collapse
|
37
|
Levy ED, Ouzounis CA, Gilks WR, Audit B. Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 2005; 6:302. [PMID: 16354297 PMCID: PMC1361783 DOI: 10.1186/1471-2105-6-302] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2005] [Accepted: 12/14/2005] [Indexed: 11/17/2022] Open
Abstract
Background One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics. Results Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases. Conclusion The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.
Collapse
Affiliation(s)
- Emmanuel D Levy
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
- Computational Genomics Group, MRC Laboratory of Molecular Biology, Hills Rd, Cambridge CB2 2QH, UK
| | - Christos A Ouzounis
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
| | - Walter R Gilks
- Medical Research Council Biostatistics Unit, Institute of Public Health, Cambridge CB2 2SR, UK
| | - Benjamin Audit
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
- Laboratoire Joliot-Curie and Laboratoire de Physique, CNRS UMR5672, Ecole Normale Supérieure, 46 Allée d'Italie, 69364 Lyon Cedex 07, France
| |
Collapse
|
38
|
Goldovsky L, Janssen P, Ahrén D, Audit B, Cases I, Darzentas N, Enright AJ, López-Bigas N, Peregrin-Alvarez JM, Smith M, Tsoka S, Kunin V, Ouzounis CA. CoGenT++: an extensive and extensible data environment for computational genomics. Bioinformatics 2005; 21:3806-10. [PMID: 16216832 DOI: 10.1093/bioinformatics/bti579] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION CoGenT++ is a data environment for computational research in comparative and functional genomics, designed to address issues of consistency, reproducibility, scalability and accessibility. DESCRIPTION CoGenT++ facilitates the re-distribution of all fully sequenced and published genomes, storing information about species, gene names and protein sequences. We describe our scalable implementation of ProXSim, a continually updated all-against-all similarity database, which stores pairwise relationships between all genome sequences. Based on these similarities, derived databases are generated for gene fusions--AllFuse, putative orthologs--OFAM, protein families--TRIBES, phylogenetic profiles--ProfUse and phylogenetic trees. Extensions based on the CoGenT++ environment include disease gene prediction, pattern discovery, automated domain detection, genome annotation and ancestral reconstruction. CONCLUSION CoGenT++ provides a comprehensive environment for computational genomics, accessible primarily for large-scale analyses as well as manual browsing.
Collapse
Affiliation(s)
- Leon Goldovsky
- Computational Genomics Group, The European Bioinformatics Institute EMBL, Cambridge Outstation, Cambridge CB10 1SD, UK
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Fujita A, Massirer KB, Durham AM, Ferreira CE, Sogayar MC. The GATO gene annotation tool for research laboratories. Braz J Med Biol Res 2005; 38:1571-4. [PMID: 16258624 DOI: 10.1590/s0100-879x2005001100002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB.
Collapse
Affiliation(s)
- A Fujita
- Instituto de Química, Universidade de São Paulo, São Paulo, SP, Brazil
| | | | | | | | | |
Collapse
|
40
|
Yu GX, Glass EM, Karonis NT, Maltsev N. Knowledge-based voting algorithm for automated protein functional annotation†. Proteins 2005; 61:907-17. [PMID: 16252283 DOI: 10.1002/prot.20652] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Automated annotation of high-throughput genome sequences is one of the earliest steps toward a comprehensive understanding of the dynamic behavior of living organisms. However, the step is often error-prone because of its underlying algorithms, which rely mainly on a simple similarity analysis, and lack of guidance from biological rules. We present herein a knowledge-based protein annotation algorithm. Our objectives are to reduce errors and to improve annotation confidences. This algorithm consists of two major components: a knowledge system, called "RuleMiner," and a voting procedure. The knowledge system, which includes biological rules and functional profiles for each function, provides a platform for seamless integration of multiple sequence analysis tools and guidance for function annotation. The voting procedure, which relies on the knowledge system, is designed to make (possibly) unbiased judgments in functional assignments among complicated, sometimes conflicting, information. We have applied this algorithm to 10 prokaryotic bacterial genomes and observed a significant improvement in annotation confidences. We also discuss the current limitations of the algorithm and the potential for future improvement.
Collapse
Affiliation(s)
- G X Yu
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois, USA.
| | | | | | | |
Collapse
|
41
|
Engelhardt BE, Jordan MI, Muratore KE, Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol 2005; 1:e45. [PMID: 16217548 PMCID: PMC1246806 DOI: 10.1371/journal.pcbi.0010045] [Citation(s) in RCA: 146] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2005] [Accepted: 08/29/2005] [Indexed: 11/19/2022] Open
Abstract
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
Collapse
Affiliation(s)
- Barbara E Engelhardt
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, United States of America.
| | | | | | | |
Collapse
|
42
|
Valencia A. Automatic annotation of protein function. Curr Opin Struct Biol 2005; 15:267-74. [PMID: 15922590 DOI: 10.1016/j.sbi.2005.05.010] [Citation(s) in RCA: 85] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2005] [Revised: 04/29/2005] [Accepted: 05/10/2005] [Indexed: 11/22/2022]
Abstract
The annotation of protein function at genomic scale is essential for day-to-day work in biology and for any systematic approach to the modeling of biological systems. Currently, functional annotation is essentially based on the expansion of the relatively small number of experimentally determined functions to large collections of proteins. The task of systematic annotation faces formidable practical problems related to the accuracy of the input experimental information, the reliability of current systems for transferring information between related sequences, and the reproducibility of the links between database information and the original experiments reported in publications. These technical difficulties merely lie on the surface of the deeper problem of the evolution of protein function in the context of protein sequences and structures. Given the mixture of technical and scientific challenges, it is not surprising that errors are introduced, and expanded, in database annotations. In this situation, a more realistic option is the development of a reliability index for database annotations, instead of depending exclusively on efforts to correct databases. Several groups have attempted to compare the database annotations of similar proteins, which constitutes the first steps toward the calibration of the relationship between sequence and annotation space.
Collapse
Affiliation(s)
- Alfonso Valencia
- Protein Design Group, National Center for Biotechnology, CNB-CSIC, Darwin 3, Cantoblanco, 28049 Madrid, Spain.
| |
Collapse
|
43
|
Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong X, Lu P, Szafron D, Greiner R, Wishart DS. BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res 2005; 33:W455-9. [PMID: 15980511 PMCID: PMC1160269 DOI: 10.1093/nar/gki593] [Citation(s) in RCA: 258] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine approximately 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colorful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 h for an average bacterial chromosome (5 Mb). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at http://wishart.biology.ualberta.ca/basys.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - David S. Wishart
- To whom correspondence should be addressed. Tel: +1 780 492 0383; Fax: +1 780 492 1071;
| |
Collapse
|
44
|
Oliveira AP, Nielsen J, Förster J. Modeling Lactococcus lactis using a genome-scale flux model. BMC Microbiol 2005; 5:39. [PMID: 15982422 PMCID: PMC1185544 DOI: 10.1186/1471-2180-5-39] [Citation(s) in RCA: 166] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2005] [Accepted: 06/27/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-scale flux models are useful tools to represent and analyze microbial metabolism. In this work we reconstructed the metabolic network of the lactic acid bacteria Lactococcus lactis and developed a genome-scale flux model able to simulate and analyze network capabilities and whole-cell function under aerobic and anaerobic continuous cultures. Flux balance analysis (FBA) and minimization of metabolic adjustment (MOMA) were used as modeling frameworks. RESULTS The metabolic network was reconstructed using the annotated genome sequence from L. lactis ssp. lactis IL1403 together with physiological and biochemical information. The established network comprised a total of 621 reactions and 509 metabolites, representing the overall metabolism of L. lactis. Experimental data reported in the literature was used to fit the model to phenotypic observations. Regulatory constraints had to be included to simulate certain metabolic features, such as the shift from homo to heterolactic fermentation. A minimal medium for in silico growth was identified, indicating the requirement of four amino acids in addition to a sugar. Remarkably, de novo biosynthesis of four other amino acids was observed even when all amino acids were supplied, which is in good agreement with experimental observations. Additionally, enhanced metabolic engineering strategies for improved diacetyl producing strains were designed. CONCLUSION The L. lactis metabolic network can now be used for a better understanding of lactococcal metabolic capabilities and potential, for the design of enhanced metabolic engineering strategies and for integration with other types of 'omic' data, to assist in finding new information on cellular organization and function.
Collapse
Affiliation(s)
- Ana Paula Oliveira
- Fluxome Sciences A/S, Søltofts Plads, Building 223, DK-2800 Kgs. Lyngby, Denmark
| | - Jens Nielsen
- Fluxome Sciences A/S, Søltofts Plads, Building 223, DK-2800 Kgs. Lyngby, Denmark
- Center for Microbial Biotechnology, BioCentrum-DTU, Technical University of Denmark, Building 223, DK-2800 Kgs. Lyngby, Denmark
| | - Jochen Förster
- Fluxome Sciences A/S, Søltofts Plads, Building 223, DK-2800 Kgs. Lyngby, Denmark
| |
Collapse
|
45
|
Koski LB, Gray MW, Lang BF, Burger G. AutoFACT: an automatic functional annotation and classification tool. BMC Bioinformatics 2005; 6:151. [PMID: 15960857 PMCID: PMC1182349 DOI: 10.1186/1471-2105-6-151] [Citation(s) in RCA: 173] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2005] [Accepted: 06/16/2005] [Indexed: 11/22/2022] Open
Abstract
Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at .
Collapse
Affiliation(s)
- Liisa B Koski
- Robert-Cedergren Center for Bioinformatics and Genomics, Université de Montréal, Montréal, Quebec, Canada
| | - Michael W Gray
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
| | - B Franz Lang
- Robert-Cedergren Center for Bioinformatics and Genomics, Université de Montréal, Montréal, Quebec, Canada
| | - Gertraud Burger
- Robert-Cedergren Center for Bioinformatics and Genomics, Université de Montréal, Montréal, Quebec, Canada
| |
Collapse
|
46
|
Ferrer E, González LM, Foster-Cuevas M, Cortéz MM, Dávila I, Rodríguez M, Sciutto E, Harrison LJS, Parkhouse RME, Gárate T. Taenia solium: characterization of a small heat shock protein (Tsol-sHSP35.6) and its possible relevance to the diagnosis and pathogenesis of neurocysticercosis. Exp Parasitol 2005; 110:1-11. [PMID: 15884156 DOI: 10.1016/j.exppara.2004.11.014] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
A cDNA encoding for a predicted small heat shock protein (sHSP), Tsol-sfISP35.6, has been isolated by antibody screening of a Taenia solium c-DNA library. The clone was a full-length sequence (1172 bp) with an open reading frame of 945 bp and encoded for a 314 amino acid protein with deduced molecular mass of 35.6 kDa, isoelectric point of 5.6 arid the characteristic HSP20/alpha-crystallin domain duplicated. It was highly conserved, with a high sequence similarity with other platyhelminth sHSPs. Western blot analysis, using serum from neurocysticercosis patients (NCC), indicated that the purified Tsol-sHSP35.6 expression product was immunogenic, while in indirect ELISA, using the purified Tsol-sHSP35.6 expression product as antigen and serum samples from pigs and humans, 80% of T. solium infected pigs and 84% of patients with active, or 71% of patients with inactive NCC were sero-positive. The possible relevance of Tsol-sHSP35.6 in the diagnosis and pathogenesis of NCC is discussed.
Collapse
MESH Headings
- Amino Acid Sequence
- Animals
- Antibodies, Helminth/blood
- Antibodies, Helminth/immunology
- Antigens, Helminth/chemistry
- Antigens, Helminth/genetics
- Antigens, Helminth/immunology
- Base Sequence
- Blotting, Western
- DNA, Complementary/chemistry
- DNA, Complementary/isolation & purification
- Electrophoresis, Polyacrylamide Gel
- Enzyme-Linked Immunosorbent Assay
- Heat-Shock Proteins/chemistry
- Heat-Shock Proteins/genetics
- Heat-Shock Proteins/immunology
- Humans
- Immune Sera/immunology
- Isoelectric Point
- Molecular Sequence Data
- Molecular Weight
- Neurocysticercosis/diagnosis
- Neurocysticercosis/parasitology
- Open Reading Frames/genetics
- Rabbits
- Recombinant Proteins/chemistry
- Recombinant Proteins/genetics
- Recombinant Proteins/immunology
- Sensitivity and Specificity
- Sequence Alignment
- Sequence Homology, Amino Acid
- Swine
- Taenia solium/chemistry
- Taenia solium/genetics
- Taenia solium/immunology
Collapse
Affiliation(s)
- Elizabeth Ferrer
- Instituto de Salud Carlos III, Centro Nacional de Microbiología, Majadahonda, Madrid, Spain
| | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Lu P, Szafron D, Greiner R, Wishart DS, Fyshe A, Pearcy B, Poulin B, Eisner R, Ngo D, Lamb N. PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization. Nucleic Acids Res 2005; 33:D147-53. [PMID: 15608166 PMCID: PMC540074 DOI: 10.1093/nar/gki120] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
PA-GOSUB (Proteome Analyst: Gene Ontology Molecular Function and Subcellular Localization) is a publicly available, web-based, searchable and downloadable database that contains the sequences, predicted GO molecular functions and predicted subcellular localizations of more than 107,000 proteins from 10 model organisms (and growing), covering the major kingdoms and phyla for which annotated proteomes exist (http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB). The PA-GOSUB database effectively expands the coverage of subcellular localization and GO function annotations by a significant factor (already over five for subcellular localization, compared with Swiss-Prot v42.7), and more model organisms are being added to PA-GOSUB as their sequenced proteomes become available. PA-GOSUB can be used in three main ways. First, a researcher can browse the pre-computed PA-GOSUB annotations on a per-organism and per-protein basis using annotation-based and text-based filters. Second, a user can perform BLAST searches against the PA-GOSUB database and use the annotations from the homologs as simple predictors for the new sequences. Third, the whole of PA-GOSUB can be downloaded in either FASTA or comma-separated values (CSV) formats.
Collapse
Affiliation(s)
- Paul Lu
- Department of Computing Science, University of Alberta, Edmonton, AB, Canada T6G 2E8.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Kaplan N, Linial M. Automatic detection of false annotations via binary property clustering. BMC Bioinformatics 2005; 6:46. [PMID: 15755318 PMCID: PMC555558 DOI: 10.1186/1471-2105-6-46] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2004] [Accepted: 03/08/2005] [Indexed: 11/10/2022] Open
Abstract
Background Computational protein annotation methods occasionally introduce errors. False-positive (FP) errors are annotations that are mistakenly associated with a protein. Such false annotations introduce errors that may spread into databases through similarity with other proteins. Generally, methods used to minimize the chance for FPs result in decreased sensitivity or low throughput. We present a novel protein-clustering method that enables automatic separation of FP from true hits. The method quantifies the biological similarity between pairs of proteins by examining each protein's annotations, and then proceeds by clustering sets of proteins that received similar annotation into biological groups. Results Using a test set of all PROSITE signatures that are marked as FPs, we show that the method successfully separates FPs in 69% of the 327 test cases supplied by PROSITE. Furthermore, we constructed an extensive random FP simulation test and show a high degree of success in detecting FP, indicating that the method is not specifically tuned for PROSITE and performs well on larger scales. We also suggest some means of predicting in which cases this approach would be successful. Conclusion Automatic detection of FPs may greatly facilitate the manual validation process and increase annotation sensitivity. With the increasing number of automatic annotations, the tendency of biological properties to be clustered, once a biological similarity measure is introduced, may become exceedingly helpful in the development of such automatic methods.
Collapse
Affiliation(s)
- Noam Kaplan
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
49
|
Chalmel F, Lardenois A, Thompson JD, Muller J, Sahel JA, Léveillard T, Poch O. GOAnno: GO annotation based on multiple alignment. Bioinformatics 2005; 21:2095-6. [PMID: 15647299 DOI: 10.1093/bioinformatics/bti252] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED GOAnno is a web tool that automatically annotates proteins according to the Gene Ontology (GO) using evolutionary information available in hierarchized multiple alignments. GO terms present in the aligned functional subfamily can be cross-validated and propagated to obtain highly reliable predicted GO annotation based on the GOAnno algorithm. AVAILABILITY The web tool and a reduced version for local installation are freely available at http://igbmc.u-strasbg.fr/GOAnno/GOAnno.html SUPPLEMENTARY INFORMATION The website supplies a detailed explanation and illustration of the algorithm at http://igbmc.u-strasbg.fr/GOAnno/GOAnnoHelp.html.
Collapse
Affiliation(s)
- F Chalmel
- Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP BP 163, Illkirch , France.
| | | | | | | | | | | | | |
Collapse
|
50
|
Toldo L, Rippmann F. Integrated bioinformatics application for automated target discovery. ACTA ACUST UNITED AC 2005. [DOI: 10.1002/asi.20137] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|