1
|
Venturini E, Maaß S, Bischler T, Becher D, Vogel J, Westermann AJ. Functional characterization of the DUF1127-containing small protein YjiS of Salmonella Typhimurium. MICROLIFE 2025; 6:uqae026. [PMID: 39790481 PMCID: PMC11707872 DOI: 10.1093/femsml/uqae026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 11/19/2024] [Accepted: 12/30/2024] [Indexed: 01/12/2025]
Abstract
Bacterial small proteins impact diverse physiological processes, however, technical challenges posed by small size hampered their systematic identification and biochemical characterization. In our quest to uncover small proteins relevant for Salmonella pathogenicity, we previously identified YjiS, a 54 amino acid protein, which is strongly induced during this pathogen's intracellular infection stage. Here, we set out to further characterize the role of YjiS. Cell culture infection assays with Salmonella mutants lacking or overexpressing YjiS suggested this small protein to delay bacterial escape from macrophages. Mutant scanning of the protein's conserved, arginine-rich DUF1127 domain excluded a major effect of single amino acid substitutions on the infection phenotype. A comparative dual RNA-seq assay uncovered the molecular footprint of YjiS in the macrophage response to infection, with host effects related to oxidative stress and the cell cortex. Bacterial cell fractionation experiments demonstrated YjiS to associate with the inner membrane and proteins interacting with YjiS in pull-down experiments were enriched for inner membrane processes. Among the YjiS interactors was the two-component system SsrA/B, the master transcriptional activator of intracellular virulence genes and a suppressor of flagellar genes. Indeed, in the absence of YjiS, we observed elevated expression of motility genes and an increased number of flagella per bacterium. Together, our study points to a role for Salmonella YjiS as a membrane-associated timer of pathogen dissemination.
Collapse
Affiliation(s)
- Elisa Venturini
- Institute of Molecular Infection Biology (IMIB), University of Würzburg, D-97080 Würzburg, Germany
| | - Sandra Maaß
- Institute of Microbiology, Department of Microbial Proteomics, University of Greifswald, D-17489 Greifswald, Germany
| | - Thorsten Bischler
- Core Unit Systems Medicine, University of Würzburg, D-97080 Würzburg, Germany
| | - Dörte Becher
- Institute of Microbiology, Department of Microbial Proteomics, University of Greifswald, D-17489 Greifswald, Germany
| | - Jörg Vogel
- Institute of Molecular Infection Biology (IMIB), University of Würzburg, D-97080 Würzburg, Germany
- Helmholtz Institute for RNA-based Infection Research (HIRI), Helmholtz Centre for Infection Research (HZI), D-97080 Würzburg, Germany
| | - Alexander J Westermann
- Helmholtz Institute for RNA-based Infection Research (HIRI), Helmholtz Centre for Infection Research (HZI), D-97080 Würzburg, Germany
- Department of Microbiology, Biocenter, University of Würzburg, D-97074 Würzburg, Germany
| |
Collapse
|
2
|
Danchin A. Artificial intelligence-based prediction of pathogen emergence and evolution in the world of synthetic biology. Microb Biotechnol 2024; 17:e70014. [PMID: 39364593 PMCID: PMC11450380 DOI: 10.1111/1751-7915.70014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Accepted: 08/29/2024] [Indexed: 10/05/2024] Open
Abstract
The emergence of new techniques in both microbial biotechnology and artificial intelligence (AI) is opening up a completely new field for monitoring and sometimes even controlling the evolution of pathogens. However, the now famous generative AI extracts and reorganizes prior knowledge from large datasets, making it poorly suited to making predictions in an unreliable future. In contrast, an unfamiliar perspective can help us identify key issues related to the emergence of new technologies, such as those arising from synthetic biology, whilst revisiting old views of AI or including generative AI as a generator of abduction as a resource. This could enable us to identify dangerous situations that are bound to emerge in the not-too-distant future, and prepare ourselves to anticipate when and where they will occur. Here, we emphasize the fact that amongst the many causes of pathogen outbreaks, often driven by the explosion of the human population, laboratory accidents are a major cause of epidemics. This review, limited to animal pathogens, concludes with a discussion of potential epidemic origins based on unusual organisms or associations of organisms that have rarely been highlighted or studied.
Collapse
Affiliation(s)
- Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of MedicineHong Kong UniversityPokfulamSAR Hong KongChina
| |
Collapse
|
3
|
Fijalkowski I, Snauwaert V, Van Damme P. Proteins à la carte: riboproteogenomic exploration of bacterial N-terminal proteoform expression. mBio 2024; 15:e0033324. [PMID: 38511928 PMCID: PMC11005335 DOI: 10.1128/mbio.00333-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
In recent years, it has become evident that the true complexity of bacterial proteomes remains underestimated. Gene annotation tools are known to propagate biases and overlook certain classes of truly expressed proteins, particularly proteoforms-protein isoforms arising from a single gene. Recent (re-)annotation efforts heavily rely on ribosome profiling by providing a direct readout of translation to fully describe bacterial proteomes. In this study, we employ a robust riboproteogenomic pipeline to conduct a systematic census of expressed N-terminal proteoform pairs, representing two isoforms encoded by a single gene raised by annotated and alternative translation initiation, in Salmonella. Intriguingly, conditional-dependent changes in relative utilization of annotated and alternative translation initiation sites (TIS) were observed in several cases. This suggests that TIS selection is subject to regulatory control, adding yet another layer of complexity to our understanding of bacterial proteomes. IMPORTANCE With the emerging theme of genes within genes comprising the existence of alternative open reading frames (ORFs) generated by translation initiation at in-frame start codons, mechanisms that control the relative utilization of annotated and alternative TIS need to be unraveled and our molecular understanding of resulting proteoforms broadened. Utilizing complementary ribosome profiling strategies to map ORF boundaries, we uncovered dual-encoding ORFs generated by in-frame TIS usage in Salmonella. Besides demonstrating that alternative TIS usage may generate proteoforms with different characteristics, such as differential localization and specialized function, quantitative aspects of conditional retapamulin-assisted ribosome profiling (Ribo-RET) translation initiation maps offer unprecedented insights into the relative utilization of annotated and alternative TIS, enabling the exploration of gene regulatory mechanisms that control TIS usage and, consequently, the translation of N-terminal proteoform pairs.
Collapse
Affiliation(s)
- Igor Fijalkowski
- iRIP Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| | - Valdes Snauwaert
- iRIP Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| | - Petra Van Damme
- iRIP Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| |
Collapse
|
4
|
García MD, Ruiz MJ, Medina LM, Vidal R, Padola NL, Etcheverria AI. Molecular and Genetic Characterization of Colicinogenic Escherichia coli Strains Active against Shiga Toxin-Producing Escherichia coli O157:H7. Foods 2023; 12:2676. [PMID: 37509768 PMCID: PMC10378606 DOI: 10.3390/foods12142676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/21/2023] [Accepted: 07/05/2023] [Indexed: 07/30/2023] Open
Abstract
The objective of this work was to molecularly and genotypically characterize and test the inhibitory activity of six colicinogenic Escherichia coli strains (ColEc) and their partially purified colicins against STEC O157:H7 isolated from clinical human cases. Inhibition tests demonstrated the activity of these strains and their colicins against STEC O157:H7. By PCR it was possible to detect colicins Ia, E7, and B and microcins M, H47, C7, and J25. By genome sequencing of two selected ColEc strains, it was possible to identify additional colicins such as E1 and Ib. No genes coding for stx1 and stx2 were detected after analyzing the genome sequence. The inhibitory activity of ColEc against STEC O157:H7 used as an indicator showed that colicins are potent growth inhibitors of E. coli O157:H7, being a potential alternative to reduce the presence of pathogens of public health relevance.
Collapse
Affiliation(s)
- Mauro D García
- Laboratorio de Inmunoquímica y Biotecnología, Centro de Investigación Veterinaria de Tandil (CIVETAN), CONICET, CICPBA, Facultad de Ciencias Veterinarias, UNICEN-Campus Universitario, Tandil B7000, Argentina
| | - María J Ruiz
- Laboratorio de Inmunoquímica y Biotecnología, Centro de Investigación Veterinaria de Tandil (CIVETAN), CONICET, CICPBA, Facultad de Ciencias Veterinarias, UNICEN-Campus Universitario, Tandil B7000, Argentina
| | - Luis M Medina
- Food Science and Technology Department, Faculty of Veterinary Medicine, Universidad de Cordoba, 14071 Córdoba, Spain
| | - Roberto Vidal
- Instituto de Ciencias biomédicas, Facultad de Medicina, Universidad de Chile, Santiago 8380453, Chile
| | - Nora L Padola
- Laboratorio de Inmunoquímica y Biotecnología, Centro de Investigación Veterinaria de Tandil (CIVETAN), CONICET, CICPBA, Facultad de Ciencias Veterinarias, UNICEN-Campus Universitario, Tandil B7000, Argentina
| | - Analía I Etcheverria
- Laboratorio de Inmunoquímica y Biotecnología, Centro de Investigación Veterinaria de Tandil (CIVETAN), CONICET, CICPBA, Facultad de Ciencias Veterinarias, UNICEN-Campus Universitario, Tandil B7000, Argentina
| |
Collapse
|
5
|
Abrahim M, Machado E, Alvarez-Valín F, de Miranda AB, Catanho M. Uncovering Pseudogenes and Intergenic Protein-coding Sequences in TriTryps' Genomes. Genome Biol Evol 2022; 14:6754225. [PMID: 36208292 PMCID: PMC9576210 DOI: 10.1093/gbe/evac142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 09/14/2022] [Accepted: 09/20/2022] [Indexed: 01/24/2023] Open
Abstract
Trypanosomatids belong to a remarkable group of unicellular, parasitic organisms of the order Kinetoplastida, an early diverging branch of the phylogenetic tree of eukaryotes, exhibiting intriguing biological characteristics affecting gene expression (intronless polycistronic transcription, trans-splicing, and RNA editing), metabolism, surface molecules, and organelles (compartmentalization of glycolysis, variation of the surface molecules, and unique mitochondrial DNA), cell biology and life cycle (phagocytic vacuoles evasion and intricate patterns of cell morphogenesis). With numerous genomic-scale data of several trypanosomatids becoming available since 2005 (genomes, transcriptomes, and proteomes), the scientific community can further investigate the mechanisms underlying these unusual features and address other unexplored phenomena possibly revealing biological aspects of the early evolution of eukaryotes. One fundamental aspect comprises the processes and mechanisms involved in the acquisition and loss of genes throughout the evolutionary history of these primitive microorganisms. Here, we present a comprehensive in silico analysis of pseudogenes in three major representatives of this group: Leishmania major, Trypanosoma brucei, and Trypanosoma cruzi. Pseudogenes, DNA segments originating from altered genes that lost their original function, are genomic relics that can offer an essential record of the evolutionary history of functional genes, as well as clues about the dynamics and evolution of hosting genomes. Scanning these genomes with functional proteins as proxies to reveal intergenic regions with protein-coding features, relying on a customized threshold to distinguish statistically and biologically significant sequence similarities, and reassembling remnant sequences from their debris, we found thousands of pseudogenes and hundreds of open reading frames, with particular characteristics in each trypanosomatid: mutation profile, number, content, density, codon bias, average size, single- or multi-copy gene origin, number and type of mutations, putative primitive function, and transcriptional activity. These features suggest a common process of pseudogene formation, different patterns of pseudogene evolution and extant biological functions, and/or distinct genome organization undertaken by those parasites during evolution, as well as different evolutionary and/or selective pressures acting on distinct lineages.
Collapse
Affiliation(s)
- Mayla Abrahim
- Laboratório de Tecnologia Imunológica, Instituto de Tecnologia em Imunobiológicos, Vice-Diretoria de Desenvolvimento Tecnológico, Bio-Manguinhos, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, RJ, Brazil
| | - Edson Machado
- Laboratório de Biologia Molecular Aplicada a Micobactérias, Instituto Oswaldo Cruz, Fiocruz, Brazil
| | - Fernando Alvarez-Valín
- Unidad de Genómica Evolutiva, Sección Biomatemática, Universidad de la República del Uruguay, Montevideo, Uruguay
| | | | | |
Collapse
|
6
|
Rodriguez AM, Urrea DA, Prada CF. Helicobacter pylori virulence factors: relationship between genetic variability and phylogeographic origin. PeerJ 2021; 9:e12272. [PMID: 34900406 PMCID: PMC8628625 DOI: 10.7717/peerj.12272] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 09/17/2021] [Indexed: 01/18/2023] Open
Abstract
Background Helicobacter pylori is a pathogenic bacteria that colonize the gastrointestinal tract from human stomachs and causes diseases including gastritis, peptic ulcers, gastric lymphoma (MALT), and gastric cancer, with a higher prevalence in developing countries. Its high genetic diversity among strains is caused by a high mutation rate, observing virulence factors (VFs) variations in different geographic lineages. This study aimed to postulate the genetic variability associated with virulence factors present in the Helicobacter pylori strains, to identify the relationship of these genes with their phylogeographic origin. Methods The complete genomes of 135 strains available in NCBI, from different population origins, were analyzed using bioinformatics tools, identifying a high rate; as well as reorganization events in 87 virulence factor genes, divided into seven functional groups, to determine changes in position, number of copies, nucleotide identity and size, contrasting them with their geographical lineage and pathogenic phenotype. Results Bioinformatics analyses show a high rate of gene annotation errors in VF. Analysis of genetic variability of VFs shown that there is not a direct relationship between the reorganization and geographic lineage. However, regarding the pathogenic phenotype demonstrated in the analysis of many copies, size, and similarity when dividing the strains that possess and not the cag pathogenicity island (cagPAI), having a higher risk of developing gastritis and peptic ulcer was evidenced. Our data has shown that the analysis of the overall genetic variability of all VFs present in each strain of H. pylori is key information in understanding its pathogenic behavior.
Collapse
Affiliation(s)
- Aura M Rodriguez
- Grupo de Investigación de Biología y Ecología de Artrópodos. Facultad de Ciencias, Universidad del Tolima, Ibague, Tolima, Colombia
| | - Daniel A Urrea
- Laboratorio de Investigaciones en Parasitología Tropical. Facultad de Ciencias, Universidad del Tolima, Ibague, Tolima, Colombia
| | - Carlos F Prada
- Grupo de Investigación de Biología y Ecología de Artrópodos. Facultad de Ciencias, Universidad del Tolima, Ibague, Tolima, Colombia
| |
Collapse
|
7
|
Proteogenomic Analysis Provides Novel Insight into Genome Annotation and Nitrogen Metabolism in Nostoc sp. PCC 7120. Microbiol Spectr 2021; 9:e0049021. [PMID: 34523988 PMCID: PMC8557916 DOI: 10.1128/spectrum.00490-21] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Cyanobacteria, capable of oxygenic photosynthesis, play a vital role in nitrogen and carbon cycles. Nostoc sp. PCC 7120 (Nostoc 7120) is a model cyanobacterium commonly used to study cell differentiation and nitrogen metabolism. Although its genome was released in 2002, a high-quality genome annotation remains unavailable for this model cyanobacterium. Therefore, in this study, we performed an in-depth proteogenomic analysis based on high-resolution mass spectrometry (MS) data to refine the genome annotation of Nostoc 7120. We unambiguously identified 5,519 predicted protein-coding genes and revealed 26 novel genes, 75 revised genes, and 27 different kinds of posttranslational modifications in Nostoc 7120. A subset of these novel proteins were further validated at both the mRNA and peptide levels. Functional analysis suggested that many newly annotated proteins may participate in nitrogen or cadmium/mercury metabolism in Nostoc 7120. Moreover, we constructed an updated Nostoc 7120 database based on our proteogenomic results and presented examples of how the updated database could be used to improve the annotation of proteomic data. Our study provides the most comprehensive annotation of the Nostoc 7120 genome thus far and will serve as a valuable resource for the study of nitrogen metabolism in Nostoc 7120. IMPORTANCE Cyanobacteria are a large group of prokaryotes capable of oxygenic photosynthesis and play a vital role in nitrogen and carbon cycles on Earth. Nostoc 7120 is a commonly used model cyanobacterium for studying cell differentiation and nitrogen metabolism. In this study, we presented the first comprehensive draft map of the Nostoc 7120 proteome and a wide range of posttranslational modifications. In addition, we constructed an updated database of Nostoc 7120 based on our proteogenomic results and presented examples of how the updated database could be used for system-level studies of Nostoc 7120. Our study provides the most comprehensive annotation of Nostoc 7120 genome and a valuable resource for the study of nitrogen metabolism in this model cyanobacterium.
Collapse
|
8
|
Renn D, Shepard L, Vancea A, Karan R, Arold ST, Rueping M. Novel Enzymes From the Red Sea Brine Pools: Current State and Potential. Front Microbiol 2021; 12:732856. [PMID: 34777282 PMCID: PMC8578733 DOI: 10.3389/fmicb.2021.732856] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/05/2021] [Indexed: 11/23/2022] Open
Abstract
The Red Sea is a marine environment with unique chemical characteristics and physical topographies. Among the various habitats offered by the Red Sea, the deep-sea brine pools are the most extreme in terms of salinity, temperature and metal contents. Nonetheless, the brine pools host rich polyextremophilic bacterial and archaeal communities. These microbial communities are promising sources for various classes of enzymes adapted to harsh environments - extremozymes. Extremozymes are emerging as novel biocatalysts for biotechnological applications due to their ability to perform catalytic reactions under harsh biophysical conditions, such as those used in many industrial processes. In this review, we provide an overview of the extremozymes from different Red Sea brine pools and discuss the overall biotechnological potential of the Red Sea proteome.
Collapse
Affiliation(s)
- Dominik Renn
- KAUST Catalysis Center (KCC), Division of Physical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Institute of Organic Chemistry, RWTH Aachen, Aachen, Germany
| | - Lera Shepard
- KAUST Catalysis Center (KCC), Division of Physical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Alexandra Vancea
- Computational Bioscience Research Center (CBRC), Division of Biological and Environmental Science and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Ram Karan
- KAUST Catalysis Center (KCC), Division of Physical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Stefan T. Arold
- Computational Bioscience Research Center (CBRC), Division of Biological and Environmental Science and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Centre de Biologie Structurale, CNRS, INSERM, Université de Montpellier, Montpellier, France
| | - Magnus Rueping
- KAUST Catalysis Center (KCC), Division of Physical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Institute for Experimental Molecular Imaging (ExMI), University Clinic, RWTH Aachen, Aachen, Germany
| |
Collapse
|
9
|
Bussi Y, Kapon R, Reich Z. Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy. PLoS One 2021; 16:e0258693. [PMID: 34648558 PMCID: PMC8516232 DOI: 10.1371/journal.pone.0258693] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 10/02/2021] [Indexed: 12/24/2022] Open
Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Collapse
Affiliation(s)
- Yuval Bussi
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Ruti Kapon
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Ziv Reich
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| |
Collapse
|
10
|
Karimi E, Geslain E, Belcour A, Frioux C, Aïte M, Siegel A, Corre E, Dittami SM. Robustness analysis of metabolic predictions in algal microbial communities based on different annotation pipelines. PeerJ 2021; 9:e11344. [PMID: 33996285 PMCID: PMC8106915 DOI: 10.7717/peerj.11344] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 04/03/2021] [Indexed: 01/29/2023] Open
Abstract
Animals, plants, and algae rely on symbiotic microorganisms for their development and functioning. Genome sequencing and genomic analyses of these microorganisms provide opportunities to construct metabolic networks and to analyze the metabolism of the symbiotic communities they constitute. Genome-scale metabolic network reconstructions rest on information gained from genome annotation. As there are multiple annotation pipelines available, the question arises to what extent differences in annotation pipelines impact outcomes of these analyses. Here, we compare five commonly used pipelines (Prokka, MaGe, IMG, DFAST, RAST) from predicted annotation features (coding sequences, Enzyme Commission numbers, hypothetical proteins) to the metabolic network-based analysis of symbiotic communities (biochemical reactions, producible compounds, and selection of minimal complementary bacterial communities). While Prokka and IMG produced the most extensive networks, RAST and DFAST networks produced the fewest false positives and the most connected networks with the fewest dead-end metabolites. Our results underline differences between the outputs of the tested pipelines at all examined levels, with small differences in the draft metabolic networks resulting in the selection of different microbial consortia to expand the metabolic capabilities of the algal host. However, the consortia generated yielded similar predicted producible compounds and could therefore be considered functionally interchangeable. This contrast between selected communities and community functions depending on the annotation pipeline needs to be taken into consideration when interpreting the results of metabolic complementarity analyses. In the future, experimental validation of bioinformatic predictions will likely be crucial to both evaluate and refine the pipelines and needs to be coupled with increased efforts to expand and improve annotations in reference databases.
Collapse
Affiliation(s)
- Elham Karimi
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Enora Geslain
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France.,FR2424, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Arnaud Belcour
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | | | - Méziane Aïte
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | - Anne Siegel
- Equipe Dyliss, Univ Rennes, Inria, CNRS, IRISA, Rennes, France
| | - Erwan Corre
- FR2424, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| | - Simon M Dittami
- UMR8227, Integrative Biology of Marine Models, Sorbonne Université/CNRS, Station Biologique de Roscoff, Roscoff, France
| |
Collapse
|
11
|
Fijalkowska D, Fijalkowski I, Willems P, Van Damme P. Bacterial riboproteogenomics: the era of N-terminal proteoform existence revealed. FEMS Microbiol Rev 2021; 44:418-431. [PMID: 32386204 DOI: 10.1093/femsre/fuaa013] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 05/07/2020] [Indexed: 12/17/2022] Open
Abstract
With the rapid increase in the number of sequenced prokaryotic genomes, relying on automated gene annotation became a necessity. Multiple lines of evidence, however, suggest that current bacterial genome annotations may contain inconsistencies and are incomplete, even for so-called well-annotated genomes. We here discuss underexplored sources of protein diversity and new methodologies for high-throughput genome reannotation. The expression of multiple molecular forms of proteins (proteoforms) from a single gene, particularly driven by alternative translation initiation, is gaining interest as a prominent contributor to bacterial protein diversity. In consequence, riboproteogenomic pipelines were proposed to comprehensively capture proteoform expression in prokaryotes by the complementary use of (positional) proteomics and the direct readout of translated genomic regions using ribosome profiling. To complement these discoveries, tailored strategies are required for the functional characterization of newly discovered bacterial proteoforms.
Collapse
Affiliation(s)
- Daria Fijalkowska
- Department of Biochemistry and Microbiology, Ghent University, K. L. Ledeganckstraat 35, B-9000 Ghent, Belgium
| | - Igor Fijalkowski
- Department of Biochemistry and Microbiology, Ghent University, K. L. Ledeganckstraat 35, B-9000 Ghent, Belgium
| | - Patrick Willems
- Department of Biochemistry and Microbiology, Ghent University, K. L. Ledeganckstraat 35, B-9000 Ghent, Belgium
| | - Petra Van Damme
- Department of Biochemistry and Microbiology, Ghent University, K. L. Ledeganckstraat 35, B-9000 Ghent, Belgium
| |
Collapse
|
12
|
Koonin EV, Makarova KS, Wolf YI. Evolution of Microbial Genomics: Conceptual Shifts over a Quarter Century. Trends Microbiol 2021; 29:582-592. [PMID: 33541841 DOI: 10.1016/j.tim.2021.01.005] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Revised: 01/07/2021] [Accepted: 01/08/2021] [Indexed: 12/20/2022]
Abstract
Prokaryote genomics started in earnest in 1995, with the complete sequences of two small bacterial genomes, those of Haemophilus influenzae and Mycoplasma genitalium. During the next quarter century, the prokaryote genome database has been growing exponentially, with no saturation in sight. For most of these 25 years, genome sequencing remained limited to cultivable microbes. Together with next-generation sequencing methods, advances in metagenomics and single-cell genomics have lifted this limitation, providing for an increasingly unbiased characterization of the global prokaryote diversity. Advances in computational genomics followed the progress of genome sequencing, even if occasionally lagging behind. Several major new branches of bacteria and archaea were discovered, including Asgard archaea, the apparent closest relatives of eukaryotes and expansive groups of bacteria and archaea with small genomes thought to be symbionts of other prokaryotes. Comparative analysis of numerous prokaryote genomes spanning a wide range of evolutionary distances changed the conceptual foundations of microbiology, supplanting the notion of species genomes with fixed gene sets with that of dynamic pangenomes and the notion of a single Tree of Life (ToL) with a statistical tree-like trend among individual gene trees. Strides were also made towards a theory and quantitative laws of prokaryote genome evolution.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA.
| | - Kira S Makarova
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| |
Collapse
|
13
|
de Oliveira Almeida R, Valente GT. Predicting metabolic pathways of plant enzymes without using sequence similarity: Models from machine learning. THE PLANT GENOME 2020; 13:e20043. [PMID: 33217216 DOI: 10.1002/tpg2.20043] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Revised: 06/03/2020] [Accepted: 06/10/2020] [Indexed: 06/11/2023]
Abstract
Most of the bioinformatics tools for enzyme annotation focus on enzymatic function assignments. Sequence similarity to well-characterized enzymes is often used for functional annotation and to assign metabolic pathways. However, these approaches are not feasible for all sequences leading to inaccurate annotations or lack of metabolic pathway information. Here we present the mApLe (metabolic pathway predictor of plant enzymes), a high-performance machine learning-based tool with models to label the metabolic pathway of enzymes rather than specifying enzymes' reactions. The mApLe uses molecular descriptors of the enzyme sequences to perform predictions without considering sequence similarities with reference sequences. Hence, mApLe can classify a diversity of enzymes, even the ones without any homolog or with incomplete EC numbers. This tool can be used to improve the quality of genomic annotation of plants or to narrow down the number of candidate genes for metabolic engineering researches. The mApLe tool is available online, and the GUI can be locally installed.
Collapse
Affiliation(s)
- Rodrigo de Oliveira Almeida
- Instituto Federal de Educação, Ciência e Tecnologia do Sudeste de Minas Gerais, Muriaé, Brazil
- Department of Bioprocess and Biotechnology, School of Agriculture, São Paulo State University (Unesp), Botucatu, Brazil
| | - Guilherme Targino Valente
- Department of Bioprocess and Biotechnology, School of Agriculture, São Paulo State University (Unesp), Botucatu, Brazil
- Department of Developmental Genetics, Max Planck Institut für Herz- und Lungenforschung, Bad Nauheim, Germany
| |
Collapse
|
14
|
Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage. mSystems 2020; 5:5/5/e00833-20. [PMID: 33109751 PMCID: PMC7593589 DOI: 10.1128/msystems.00833-20] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years. Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation. IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.
Collapse
|
15
|
Steinegger M, Salzberg SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol 2020; 21:115. [PMID: 32398145 PMCID: PMC7218494 DOI: 10.1186/s13059-020-02023-1] [Citation(s) in RCA: 126] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 04/16/2020] [Indexed: 12/20/2022] Open
Abstract
Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator
Collapse
Affiliation(s)
- Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, 08826, South Korea. .,Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA. .,Institute of Molecular Biology and Genetics, Seoul National University, Seoul, 08826, South Korea.
| | - Steven L Salzberg
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA.,Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, 21218, Maryland, USA
| |
Collapse
|
16
|
Unadkat K, Whittall JB. Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals. PLoS One 2020; 15:e0223203. [PMID: 32287315 PMCID: PMC7156063 DOI: 10.1371/journal.pone.0223203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 03/26/2020] [Indexed: 12/05/2022] Open
Abstract
There is a molecular basis for many sleep patterns and disorders involving circadian clock genes. In humans, "short-sleeper" behavior has been linked to specific amino acid substitutions in BHLHE41 (DEC2), yet little is known about variation at these sites and across this gene in mammals. We compare BHLHE41 coding sequences for 27 mammals. Approximately half of the coding sequence was invariable at the nucleotide level and close to three-quarters of the amino acid alignment was identical. No other mammals had the same "short-sleeper" amino acid substitutions previously described from humans. Phylogenetic analyses based on the nucleotides of the coding sequence alignment are consistent with established mammalian relationships confirming orthology among the sampled sequences. Significant purifying selection was detected in about two-thirds of the variable codons and no codons exhibited significant signs of positive selection. Unexpectedly, the gorilla BHLHE41 sequence has a 318 bp insertion at the 5' end of the coding sequence and a deletion of 195 bp near the 3' end of the coding sequence (including the two short sleeper variable sites). Given the strong signal of purifying selection across this gene, phylogenetic congruence with expected relationships and generally conserved function among mammals investigated thus far, we suggest the indels predicted in the gorilla BHLHE41 may represent an annotation error and warrant experimental validation.
Collapse
Affiliation(s)
- Krishna Unadkat
- Department of Biology, Santa Clara University, Santa Clara, California, United States of America
| | - Justen B. Whittall
- Department of Biology, Santa Clara University, Santa Clara, California, United States of America
| |
Collapse
|
17
|
Prifti E, Chevaleyre Y, Hanczar B, Belda E, Danchin A, Clément K, Zucker JD. Interpretable and accurate prediction models for metagenomics data. Gigascience 2020; 9:giaa010. [PMID: 32150601 PMCID: PMC7062144 DOI: 10.1093/gigascience/giaa010] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Revised: 09/12/2019] [Accepted: 01/27/2020] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Microbiome biomarker discovery for patient diagnosis, prognosis, and risk evaluation is attracting broad interest. Selected groups of microbial features provide signatures that characterize host disease states such as cancer or cardio-metabolic diseases. Yet, the current predictive models stemming from machine learning still behave as black boxes and seldom generalize well. Their interpretation is challenging for physicians and biologists, which makes them difficult to trust and use routinely in the physician-patient decision-making process. Novel methods that provide interpretability and biological insight are needed. Here, we introduce "predomics", an original machine learning approach inspired by microbial ecosystem interactions that is tailored for metagenomics data. It discovers accurate predictive signatures and provides unprecedented interpretability. The decision provided by the predictive model is based on a simple, yet powerful score computed by adding, subtracting, or dividing cumulative abundance of microbiome measurements. RESULTS Tested on >100 datasets, we demonstrate that predomics models are simple and highly interpretable. Even with such simplicity, they are at least as accurate as state-of-the-art methods. The family of best models, discovered during the learning process, offers the ability to distil biological information and to decipher the predictability signatures of the studied condition. In a proof-of-concept experiment, we successfully predicted body corpulence and metabolic improvement after bariatric surgery using pre-surgery microbiome data. CONCLUSIONS Predomics is a new algorithm that helps in providing reliable and trustworthy diagnostic decisions in the microbiome field. Predomics is in accord with societal and legal requirements that plead for an explainable artificial intelligence approach in the medical field.
Collapse
Affiliation(s)
- Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 Avenue Henri Varagnat, F-93143 Bondy, France
- Institute of Cardiometabolism and Nutrition, ICAN, Integromics, 91 Boulevard de l'Hopital, F-75013, Paris, France
| | - Yann Chevaleyre
- Paris-Dauphine University, PSL Research University, CNRS, UMR 7243, LAMSADE, place du Mal. de Lattre de Tassigny, F-75016, Paris, France
| | - Blaise Hanczar
- IBISC, University Paris-Saclay, University Evry, Evry, 23 Boulevard de France, F-91034, France
| | - Eugeni Belda
- Institute of Cardiometabolism and Nutrition, ICAN, Integromics, 91 Boulevard de l'Hopital, F-75013, Paris, France
| | - Antoine Danchin
- Institut Cochin INSERM U1016−CNRS UMR8104−Université Paris Descartes, 24 Rue du Faubourg Saint-Jacques, F-75014, Paris, France
| | - Karine Clément
- Sorbonne University, INSERM, Nutrition and Obesities; Systemic Approach Research Unit (NutriOmics), 91 Boulevard de l'Hopital, F-75013, Paris, France
- Assistance Publique-Hôpitaux de Paris, Nutrition Department, CRNH Ile de France, Pitié-Salpêtrière Hospital, 91 Boulevard de l'Hopital, F-75013, Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 Avenue Henri Varagnat, F-93143 Bondy, France
- Institute of Cardiometabolism and Nutrition, ICAN, Integromics, 91 Boulevard de l'Hopital, F-75013, Paris, France
- Sorbonne University, INSERM, Nutrition and Obesities; Systemic Approach Research Unit (NutriOmics), 91 Boulevard de l'Hopital, F-75013, Paris, France
| |
Collapse
|
18
|
Amoako DG, Somboro AM, Abia ALK, Allam M, Ismail A, Bester LA, Essack SY. Genome Mining and Comparative Pathogenomic Analysis of An Endemic Methicillin-Resistant Staphylococcus Aureus (MRSA) Clone, ST612-CC8-t1257-SCCmec_IVd(2B), Isolated in South Africa. Pathogens 2019; 8:E166. [PMID: 31569754 PMCID: PMC6963616 DOI: 10.3390/pathogens8040166] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 12/19/2022] Open
Abstract
This study undertook genome mining and comparative genomics to gain genetic insights into the dominance of the methicillin-resistant Staphylococcus aureus (MRSA) endemic clone ST612-CC8-t1257-SCCmec_IVd(2B), obtained from the poultry food chain in South Africa. Functional annotation of the genome revealed a vast array of similar central metabolic, cellular and biochemical networks within the endemic clone crucial for its survival in the microbial community. In-silico analysis of the clone revealed the possession of uniform defense systems, restriction-modification system (type I and IV), accessory gene regulator (type I), arginine catabolic mobile element (type II), and type 1 clustered, regularly interspaced, short palindromic repeat (CRISPR)Cas array (N = 7 ± 1), which offer protection against exogenous attacks. The estimated pathogenic potential predicted a higher probability (average Pscore ≈ 0.927) of the clone being pathogenic to its host. The clone carried a battery of putative virulence determinants whose expression are critical for establishing infection. However, there was a slight difference in their possession of adherence factors (biofilm operon system) and toxins (hemolysins and enterotoxins). Further analysis revealed a conserved environmental tolerance and persistence mechanisms related to stress (oxidative and osmotic), heat shock, sporulation, bacteriocins, and detoxification, which enable it to withstand lethal threats and contribute to its success in diverse ecological niches. Phylogenomic analysis with close sister lineages revealed that the clone was closely related to the MRSA isolate SHV713 from Australia. The results of this bioinformatic analysis provide valuable insights into the biology of this endemic clone.
Collapse
Affiliation(s)
- Daniel Gyamfi Amoako
- Infection Genomics and Applied Bioinformatics Division, Antimicrobial Research Unit, College of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa.
- Biomedical Resource Unit, School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal; Durban 4000, South Africa.
| | - Anou M Somboro
- Biomedical Resource Unit, School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal; Durban 4000, South Africa.
- Antimicrobial Research Unit, College of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa.
| | - Akebe Luther King Abia
- Antimicrobial Research Unit, College of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa.
| | - Mushal Allam
- Sequencing Core Facility, National Institute for Communicable Diseases, National Health Laboratory Service, Johannesburg 2131, South Africa.
| | - Arshad Ismail
- Sequencing Core Facility, National Institute for Communicable Diseases, National Health Laboratory Service, Johannesburg 2131, South Africa.
| | - Linda A Bester
- Biomedical Resource Unit, School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal; Durban 4000, South Africa.
| | - Sabiha Y Essack
- Antimicrobial Research Unit, College of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa.
| |
Collapse
|
19
|
Caballero M, Wegrzyn J. gFACs: Gene Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gene Prediction Frameworks. GENOMICS, PROTEOMICS & BIOINFORMATICS 2019; 17:305-310. [PMID: 31437583 PMCID: PMC6818179 DOI: 10.1016/j.gpb.2019.04.002] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/21/2019] [Accepted: 04/29/2019] [Indexed: 11/26/2022]
Abstract
Published genomes frequently contain erroneous gene models that represent issues associated with identification of open reading frames, start sites, splice sites, and related structural features. The source of these inconsistencies is often traced back to integration across text file formats designed to describe long read alignments and predicted gene structures. In addition, the majority of gene prediction frameworks do not provide robust downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. These frameworks also lack consideration for functional attributes, such as the presence or absence of protein domains that can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present a software package, the Gene Filtering, Analysis, and Conversion (gFACs), to filter, analyze, and convert predicted gene models and alignments. The software operates across a wide range of alignment, analysis, and gene prediction files with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers, and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space. gFACs is freely available and implemented in Perl with support from BioPerl libraries at https://gitlab.com/PlantGenomicsLab/gFACs.
Collapse
Affiliation(s)
- Madison Caballero
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.
| | - Jill Wegrzyn
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT 06269, USA.
| |
Collapse
|
20
|
Abstract
Over 100 whole-genome sequences from algae are published or soon to be published. The rapidly increasing availability of these fundamental resources is changing how we understand one of the most diverse, complex, and understudied groups of photosynthetic eukaryotes. Genome sequences provide a window into the functional potential of individual algae, with phylogenomics and functional genomics as tools for contextualizing and transferring knowledge from reference organisms into less well-characterized systems. Remarkably, over half of the proteins encoded by algal genomes are of unknown function, highlighting the volume of functional capabilities yet to be discovered. In this review, we provide an overview of publicly available algal genomes, their associated protein inventories, and their quality, with a summary of the statuses of protein function understanding and predictions.
Collapse
Affiliation(s)
| | - Sabeeha S Merchant
- Departments of Plant and Microbial Biology and Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Institute for Genomics and Proteomics, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
21
|
Lockwood S, Brayton KA, Daily JA, Broschat SL. Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues. Front Microbiol 2019; 10:383. [PMID: 30873148 PMCID: PMC6403173 DOI: 10.3389/fmicb.2019.00383] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 02/13/2019] [Indexed: 11/24/2022] Open
Abstract
We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta′ (RpoB/RpoB′), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB′ were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB′ proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB′ were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.
Collapse
Affiliation(s)
- Svetlana Lockwood
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States
| | - Kelly A Brayton
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.,Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.,Paul G. Allen School for Global Animal Health, Washington State University, Pullman, WA, United States
| | - Jeff A Daily
- Pacific Northwest National Laboratory, Richland, WA, United States
| | - Shira L Broschat
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States.,Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.,Paul G. Allen School for Global Animal Health, Washington State University, Pullman, WA, United States
| |
Collapse
|
22
|
Salazar AN, Abeel T. Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations. Bioinformatics 2018; 34:i732-i742. [PMID: 30423098 PMCID: PMC6129293 DOI: 10.1093/bioinformatics/bty614] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation A long-standing limitation in comparative genomic studies is the dependency on a reference genome, which hinders the spectrum of genetic diversity that can be identified across a population of organisms. This is especially true in the microbial world where genome architectures can significantly vary. There is therefore a need for computational methods that can simultaneously analyze the architectures of multiple genomes without introducing bias from a reference. Results In this article, we present Ptolemy: a novel method for studying the diversity of genome architectures-such as structural variation and pan-genomes-across a collection of microbial assemblies without the need of a reference. Ptolemy is a 'top-down' approach to compare whole genome assemblies. Genomes are represented as labeled multi-directed graphs-known as quivers-which are then merged into a single, canonical quiver by identifying 'gene anchors' via synteny analysis. The canonical quiver represents an approximate, structural alignment of all genomes in a given collection encoding structural variation across (sub-) populations within the collection. We highlight various applications of Ptolemy by analyzing structural variation and the pan-genomes of different datasets composing of Mycobacterium, Saccharomyces, Escherichia and Shigella species. Our results show that Ptolemy is flexible and can handle both conserved and highly dynamic genome architectures. Ptolemy is user-friendly-requires only FASTA-formatted assembly along with a corresponding GFF-formatted file-and resource-friendly-can align 24 genomes in ∼10 mins with four CPUs and <2 GB of RAM. Availability and implementation Github: https://github.com/AbeelLab/ptolemy. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alex N Salazar
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
23
|
Moitra K. Releasing the "GENI": integrating authentic microbial genomics research into the classroom through GENI-ACT. FEMS Microbiol Lett 2018; 364:4443195. [PMID: 29040493 DOI: 10.1093/femsle/fnx215] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2017] [Accepted: 10/04/2017] [Indexed: 11/15/2022] Open
Abstract
The integration of genomics research into the undergraduate biology curriculum provides students with the opportunity to become familiar with bioinformatics tools and answer original research questions. Our purpose with this research project was to upscale the research experience through integration with classroom experience giving students access to authentic research projects. Students annotated 60 predicted ABC genes of Methanothermobacter thermautotrophicus and Methanobacterium sp. SWAN-1, and they were required to present a research poster to demonstrate their understanding of the project. During this research project a number of tests, assessments and surveys were conducted to assess familiarity with technical and conceptual understanding of genome annotation, satisfaction with annotation instruction, gain in bioinformatics research skills, scientific communications skills and increased student interest in research. We found that students gained significant skills in bioinformatics, specifically genome annotation skills and also gained confidence in their abilities to carry out scientific research. As a result of this authentic undergraduate research experience under-represented students were motivated to pursue future careers in STEM fields.
Collapse
Affiliation(s)
- Karobi Moitra
- Department of Biology, Trinity Washington University, College Of Arts and Sciences, 125 Michigan Avenue NE, Washington DC 20017, USA
| |
Collapse
|
24
|
A Simple and Universal System for Gene Manipulation in Aspergillus fumigatus: In Vitro-Assembled Cas9-Guide RNA Ribonucleoproteins Coupled with Microhomology Repair Templates. mSphere 2017; 2:mSphere00446-17. [PMID: 29202040 PMCID: PMC5700375 DOI: 10.1128/msphere.00446-17] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 11/06/2017] [Indexed: 01/09/2023] Open
Abstract
Tackling the multifactorial nature of virulence and antifungal drug resistance in A. fumigatus requires the mechanistic interrogation of a multitude of genes, sometimes across multiple genetic backgrounds. Classical fungal gene replacement systems can be laborious and time-consuming and, in wild-type isolates, are impeded by low rates of homologous recombination. Our simple and universal CRISPR-Cas9 system for gene manipulation generates efficient gene targeting across different genetic backgrounds of A. fumigatus. We anticipate that our system will simplify genome editing in A. fumigatus, allowing for the generation of single- and multigene knockout libraries. In addition, our system will facilitate the delineation of virulence factors and antifungal drug resistance genes in different genetic backgrounds of A. fumigatus. CRISPR (clustered regularly interspaced short palindromic repeat)-Cas9 is a novel genome-editing system that has been successfully established in Aspergillus fumigatus. However, the current state of the technology relies heavily on DNA-based expression cassettes for delivering Cas9 and the guide RNA (gRNA) to the cell. Therefore, the power of the technology is limited to strains that are engineered to express Cas9 and gRNA. To overcome such limitations, we developed a simple and universal CRISPR-Cas9 system for gene deletion that works across different genetic backgrounds of A. fumigatus. The system employs in vitro assembly of dual Cas9 ribonucleoproteins (RNPs) for targeted gene deletion. Additionally, our CRISPR-Cas9 system utilizes 35 to 50 bp of flanking regions for mediating homologous recombination at Cas9 double-strand breaks (DSBs). As a proof of concept, we first tested our system in the ΔakuB (ΔakuBku80) laboratory strain and generated high rates (97%) of gene deletion using 2 µg of the repair template flanked by homology regions as short as 35 bp. Next, we inspected the portability of our system across other genetic backgrounds of A. fumigatus, namely, the wild-type strain Af293 and a clinical isolate, A. fumigatus DI15-102. In the Af293 strain, 2 µg of the repair template flanked by 35 and 50 bp of homology resulted in highly efficient gene deletion (46% and 74%, respectively) in comparison to classical gene replacement systems. Similar deletion efficiencies were also obtained in the clinical isolate DI15-102. Taken together, our data show that in vitro-assembled Cas9 RNPs coupled with microhomology repair templates are an efficient and universal system for gene manipulation in A. fumigatus. IMPORTANCE Tackling the multifactorial nature of virulence and antifungal drug resistance in A. fumigatus requires the mechanistic interrogation of a multitude of genes, sometimes across multiple genetic backgrounds. Classical fungal gene replacement systems can be laborious and time-consuming and, in wild-type isolates, are impeded by low rates of homologous recombination. Our simple and universal CRISPR-Cas9 system for gene manipulation generates efficient gene targeting across different genetic backgrounds of A. fumigatus. We anticipate that our system will simplify genome editing in A. fumigatus, allowing for the generation of single- and multigene knockout libraries. In addition, our system will facilitate the delineation of virulence factors and antifungal drug resistance genes in different genetic backgrounds of A. fumigatus.
Collapse
|
25
|
Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| |
Collapse
|
26
|
Mining of Microbial Genomes for the Novel Sources of Nitrilases. BIOMED RESEARCH INTERNATIONAL 2017; 2017:7039245. [PMID: 28497061 PMCID: PMC5405348 DOI: 10.1155/2017/7039245] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Revised: 02/14/2017] [Accepted: 03/07/2017] [Indexed: 12/14/2022]
Abstract
Next-generation DNA sequencing (NGS) has made it feasible to sequence large number of microbial genomes and advancements in computational biology have opened enormous opportunities to mine genome sequence data for novel genes and enzymes or their sources. In the present communication in silico mining of microbial genomes has been carried out to find novel sources of nitrilases. The sequences selected were analyzed for homology and considered for designing motifs. The manually designed motifs based on amino acid sequences of nitrilases were used to screen 2000 microbial genomes (translated to proteomes). This resulted in identification of one hundred thirty-eight putative/hypothetical sequences which could potentially code for nitrilase activity. In vitro validation of nine predicted sources of nitrilases was done for nitrile/cyanide hydrolyzing activity. Out of nine predicted nitrilases, Gluconacetobacter diazotrophicus, Sphingopyxis alaskensis, Saccharomonospora viridis, and Shimwellia blattae were specific for aliphatic nitriles, whereas nitrilases from Geodermatophilus obscurus, Nocardiopsis dassonvillei, Runella slithyformis, and Streptomyces albus possessed activity for aromatic nitriles. Flavobacterium indicum was specific towards potassium cyanide (KCN) which revealed the presence of nitrilase homolog, that is, cyanide dihydratase with no activity for either aliphatic, aromatic, or aryl nitriles. The present study reports the novel sources of nitrilases and cyanide dihydratase which were not reported hitherto by in silico or in vitro studies.
Collapse
|
27
|
Altermann E, Lu J, McCulloch A. GAMOLA2, a Comprehensive Software Package for the Annotation and Curation of Draft and Complete Microbial Genomes. Front Microbiol 2017; 8:346. [PMID: 28386247 PMCID: PMC5362640 DOI: 10.3389/fmicb.2017.00346] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Accepted: 02/20/2017] [Indexed: 11/13/2022] Open
Abstract
Expert curated annotation remains one of the critical steps in achieving a reliable biological relevant annotation. Here we announce the release of GAMOLA2, a user friendly and comprehensive software package to process, annotate and curate draft and complete bacterial, archaeal, and viral genomes. GAMOLA2 represents a wrapping tool to combine gene model determination, functional Blast, COG, Pfam, and TIGRfam analyses with structural predictions including detection of tRNAs, rRNA genes, non-coding RNAs, signal protein cleavage sites, transmembrane helices, CRISPR repeats and vector sequence contaminations. GAMOLA2 has already been validated in a wide range of bacterial and archaeal genomes, and its modular concept allows easy addition of further functionality in future releases. A modified and adapted version of the Artemis Genome Viewer (Sanger Institute) has been developed to leverage the additional features and underlying information provided by the GAMOLA2 analysis, and is part of the software distribution. In addition to genome annotations, GAMOLA2 features, among others, supplemental modules that assist in the creation of custom Blast databases, annotation transfers between genome versions, and the preparation of Genbank files for submission via the NCBI Sequin tool. GAMOLA2 is intended to be run under a Linux environment, whereas the subsequent visualization and manual curation in Artemis is mobile and platform independent. The development of GAMOLA2 is ongoing and community driven. New functionality can easily be added upon user requests, ensuring that GAMOLA2 provides information relevant to microbiologists. The software is available free of charge for academic use.
Collapse
Affiliation(s)
- Eric Altermann
- AgResearch Limited, Grasslands Research CentrePalmerston North, New Zealand; Riddet Institute, Massey UniversityPalmerston North, New Zealand
| | - Jingli Lu
- AgResearch Limited, Grasslands Research Centre Palmerston North, New Zealand
| | - Alan McCulloch
- AgResearch Limited, Invermay Agricultural Centre Mosgiel, New Zealand
| |
Collapse
|
28
|
Zeng W, Fang F, Liu S, Du G, Chen J, Zhou J. Comparative genomics analysis of a series of Yarrowia lipolytica WSH-Z06 mutants with varied capacity for α-ketoglutarate production. J Biotechnol 2016; 239:76-82. [PMID: 27732868 DOI: 10.1016/j.jbiotec.2016.10.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Revised: 09/16/2016] [Accepted: 10/07/2016] [Indexed: 01/23/2023]
Abstract
Yarrowia lipolytica is one of the most intensively investigated α-ketoglutaric acid (α-KG) producers, and metabolic engineering has proven effective for enhancing production. However, regulation of α-KG metabolism remains poorly understood. Genetic engineering of new strains is accompanied by potential safety concerns in some countries and regions. A series of mutants with varied capacity for α-KG production were obtained using random mutagenesis of Y. lipolytica WSH-Z06. Comparative genomics analysis was implemented to identify genes candidates associated with α-KG production. Manipulation of genes regulating mitochondrial biogenesis and energy metabolism could improve α-KG production, while genes involved in regulating transformation between keto acids and amino acids may decrease production. One gene associated with cell cycle control well represented in all mutants, whereas this gene involved in cell concentration do not appear to influence α-KG production. The results shed light on α-KG production in eukaryotic cells, and pave the way for a high-throughput screening and random mutagenesis method for enhancing α-KG production.
Collapse
Affiliation(s)
- Weizhu Zeng
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Fang Fang
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Synergetic Innovation Center of Food Safety and Nutrition, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Song Liu
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Synergetic Innovation Center of Food Safety and Nutrition, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Guocheng Du
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Synergetic Innovation Center of Food Safety and Nutrition, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Jian Chen
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Synergetic Innovation Center of Food Safety and Nutrition, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Jingwen Zhou
- School of Biotechnology and Key Laboratory of Industrial Biotechnology, Ministry of Education, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Synergetic Innovation Center of Food Safety and Nutrition, 1800 Lihu Road, Wuxi, Jiangsu 214122, China.
| |
Collapse
|
29
|
Determination of potential metabolic pathways of human intestinal bacteria by modeling growth kinetics from cross-feeding dynamics. Food Res Int 2016. [DOI: 10.1016/j.foodres.2016.02.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
30
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
31
|
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 2016; 44:6614-24. [PMID: 27342282 PMCID: PMC5001611 DOI: 10.1093/nar/gkw569] [Citation(s) in RCA: 4885] [Impact Index Per Article: 542.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Revised: 06/08/2016] [Accepted: 06/13/2016] [Indexed: 12/01/2022] Open
Abstract
Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.
Collapse
Affiliation(s)
- Tatiana Tatusova
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Michael DiCuccio
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Azat Badretdin
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Vyacheslav Chetvernin
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Eric P Nawrocki
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Leonid Zaslavsky
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA
| | - Kim D Pruitt
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Tech, Atlanta, GA 30332, USA
| | - James Ostell
- National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA
| |
Collapse
|
32
|
Abstract
The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The method is based on a comparison of the observed and expected distribution of all TISs in a particular genome given prior gene-calling. We have assessed the TIS annotations for all available NCBI RefSeq microbial genomes and found that approximately 87% is of appropriate quality, whereas 13% needs substantial improvement. We have analyzed a number of factors that could affect TIS annotation quality such as GC-content, taxonomy, the fraction of genes with a Shine-Dalgarno sequence and the year of publication. The analysis showed that only the first factor has a clear effect. We have then formulated a straightforward Principle Component Analysis-based TIS identification strategy to self-organize and score potential TISs. The strategy is independent of reference data and a priori calculations. A representative set of 277 genomes was subjected to the analysis and we found a clear increase in TIS annotation quality for the genomes with a low quality score. The PCA-based annotation was also compared with annotation with the current tool of reference, Prodigal. The comparison for the model genome of Escherichia coli K12 showed that both methods supplement each other and that prediction agreement can be used as an indicator of a correct TIS annotation. Importantly, the data suggest that the addition of a PCA-based strategy to a Prodigal prediction can be used to ‘flag’ TIS annotations for re-evaluation and in addition can be used to evaluate a given annotation in case a Prodigal annotation is lacking.
Collapse
|
33
|
Scaria J, Suzuki H, Ptak CP, Chen JW, Zhu Y, Guo XK, Chang YF. Comparative genomic and phenomic analysis of Clostridium difficile and Clostridium sordellii, two related pathogens with differing host tissue preference. BMC Genomics 2015; 16:448. [PMID: 26059449 PMCID: PMC4462011 DOI: 10.1186/s12864-015-1663-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 05/29/2015] [Indexed: 01/05/2023] Open
Abstract
Background Clostridium difficile and C. sordellii are two anaerobic, spore forming, gram positive pathogens with a broad host range and the ability to cause lethal infections. Despite strong similarities between the two Clostridial strains, differences in their host tissue preference place C. difficile infections in the gastrointestinal tract and C. sordellii infections in soft tissues. Results In this study, to improve our understanding of C. sordellii and C. difficile virulence and pathogenesis, we have performed a comparative genomic and phenomic analysis of the two. The global phenomes of C. difficile and C. sordellii were compared using Biolog Phenotype microarrays. When compared to C. difficile, C. sordellii was found to better utilize more complex sources of carbon and nitrogen, including peptides. Phenotype microarray comparison also revealed that C. sordellii was better able to grow in acidic pH conditions. Using next generation sequencing technology, we determined the draft genome of C. sordellii strain 8483 and performed comparative genome analysis with C. difficile and other Clostridial genomes. Comparative genome analysis revealed the presence of several enzymes, including the urease gene cluster, specific to the C. sordellii genome that confer the ability of expanded peptide utilization and survival in acidic pH. Conclusions The identified phenotypes of C. sordellii might be important in causing wound and vaginal infections respectively. Proteins involved in the metabolic differences between C. sordellii and C. difficile should be targets for further studies aimed at understanding C. difficile and C. sordellii infection site specificity and pathogenesis. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1663-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Joy Scaria
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA. .,Department of Veterinary and Biomedical Sciences, South Dakota State University, Brookings, SD, 57007, USA.
| | - Haruo Suzuki
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA. .,Graduate School of Science and Engineering, Yamaguchi University, Yamaguchi, Japan.
| | - Christopher P Ptak
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA.
| | - Jenn-Wei Chen
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA.
| | - Yongzhang Zhu
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA. .,Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Xiao-Kui Guo
- Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Yung-Fu Chang
- Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, 14853, USA.
| |
Collapse
|
34
|
Matthews TD, Schmieder R, Silva GGZ, Busch J, Cassman N, Dutilh BE, Green D, Matlock B, Heffernan B, Olsen GJ, Farris Hanna L, Schifferli DM, Maloy S, Dinsdale EA, Edwards RA. Genomic Comparison of the Closely-Related Salmonella enterica Serovars Enteritidis, Dublin and Gallinarum. PLoS One 2015; 10:e0126883. [PMID: 26039056 PMCID: PMC4454671 DOI: 10.1371/journal.pone.0126883] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Accepted: 04/08/2015] [Indexed: 11/18/2022] Open
Abstract
The Salmonella enterica serovars Enteritidis, Dublin, and Gallinarum are closely related but differ in virulence and host range. To identify the genetic elements responsible for these differences and to better understand how these serovars are evolving, we sequenced the genomes of Enteritidis strain LK5 and Dublin strain SARB12 and compared these genomes to the publicly available Enteritidis P125109, Dublin CT 02021853 and Dublin SD3246 genome sequences. We also compared the publicly available Gallinarum genome sequences from biotype Gallinarum 287/91 and Pullorum RKS5078. Using bioinformatic approaches, we identified single nucleotide polymorphisms, insertions, deletions, and differences in prophage and pseudogene content between strains belonging to the same serovar. Through our analysis we also identified several prophage cargo genes and pseudogenes that affect virulence and may contribute to a host-specific, systemic lifestyle. These results strongly argue that the Enteritidis, Dublin and Gallinarum serovars of Salmonella enterica evolve by acquiring new genes through horizontal gene transfer, followed by the formation of pseudogenes. The loss of genes necessary for a gastrointestinal lifestyle ultimately leads to a systemic lifestyle and niche exclusion in the host-specific serovars.
Collapse
Affiliation(s)
- T. David Matthews
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
| | - Robert Schmieder
- Department of Computer Science, San Diego State University, San Diego, California, 92182, United States of America
| | - Genivaldo G. Z. Silva
- Computational Science Research Center, San Diego State University, San Diego, California, 92182, United States of America
| | - Julia Busch
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
| | - Noriko Cassman
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
| | - Bas E. Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, The Netherlands
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - Dawn Green
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Brian Matlock
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Brian Heffernan
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Gary J. Olsen
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Leigh Farris Hanna
- Molecular Sciences Department, University of Tennessee Health Sciences Center, 858 Madison Ave, Memphis, Tennessee, United States of America
| | - Dieter M. Schifferli
- University of Pennsylvania School of Veterinary Medicine, 3800 Spruce St, Philadelphia, Pennsylvania, 19104, United States of America
| | - Stanley Maloy
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
| | - Elizabeth A. Dinsdale
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
| | - Robert A. Edwards
- Department of Biology, San Diego State University, San Diego, California, 92182, United States of America
- Department of Computer Science, San Diego State University, San Diego, California, 92182, United States of America
- Department of Marine Biology, Institute of Biology, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
- Argonne National Laboratory, 9700 S. Cass Ave, Argonne, Illinois, 60349, United States of America
- * E-mail:
| |
Collapse
|
35
|
Yu JF, Guo J, Liu QB, Hou Y, Xiao K, Chen QL, Wang JH, Sun X. A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome. Genes Genomics 2015. [DOI: 10.1007/s13258-014-0263-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
36
|
López-Campos G, Aguado-Urda M, Blanco MM, Gibello A, Cutuli MT, López-Alonso V, Martín-Sánchez F, Fernández-Garayzábal JF. Lactococcus garvieae: a small bacteria and a big data world. Health Inf Sci Syst 2015; 3:S5. [PMID: 25960872 PMCID: PMC4416232 DOI: 10.1186/2047-2501-3-s1-s5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Objective To describe the importance of bioinformatics tools to analyze the big data yielded from new "omics" generation-methods, with the aim of unraveling the biology of the pathogen bacteria Lactococcus garvieae. Methods The paper provides the vision of the large volume of data generated from genome sequences, gene expression profiles by microarrays and other experimental methods that require biomedical informatics methods for management and analysis. Results The use of biomedical informatics methods improves the analysis of big data in order to obtain a comprehensive characterization and understanding of the biology of pathogenic organisms, such as L. garvieae. Conclusions The "Big Data" concepts of high volume, veracity and variety are nowadays part of the research in microbiology associated with the use of multiple methods in the "omic" era. The use of biomedical informatics methods is a requisite necessary to improve the analysis of these data.
Collapse
Affiliation(s)
- Guillermo López-Campos
- Health and Biomedical Informatics Centre (HABIC), The University of Melbourne, Melbourne, Victoria, 3010, Australia
| | - Mónica Aguado-Urda
- Faculty of Veterinary Sciences, Department of Animal Health, Complutense University, Madrid, 28040, Spain
| | - María Mar Blanco
- Faculty of Veterinary Sciences, Department of Animal Health, Complutense University, Madrid, 28040, Spain
| | - Alicia Gibello
- Faculty of Veterinary Sciences, Department of Animal Health, Complutense University, Madrid, 28040, Spain
| | - María Teresa Cutuli
- Faculty of Veterinary Sciences, Department of Animal Health, Complutense University, Madrid, 28040, Spain
| | - Victoria López-Alonso
- Computational Biology Unit, National Institute of Health "Carlos III", Madrid, 28220, Spain
| | - Fernando Martín-Sánchez
- Health and Biomedical Informatics Centre (HABIC), The University of Melbourne, Melbourne, Victoria, 3010, Australia
| | | |
Collapse
|
37
|
Eastman AW, Yuan ZC. Development and validation of an rDNA operon based primer walking strategy applicable to de novo bacterial genome finishing. Front Microbiol 2015; 5:769. [PMID: 25653642 PMCID: PMC4301005 DOI: 10.3389/fmicb.2014.00769] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 12/16/2014] [Indexed: 01/10/2023] Open
Abstract
Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing projects.
Collapse
Affiliation(s)
- Alexander W Eastman
- Southern Crop Protection and Food Research Centre, Agriculture and Agri-Food Canada, Government of Canada London, ON, Canada ; Department of Microbiology and Immunology, Schulich School of Medicine and Dentistry, University of Western Ontario London, ON, Canada
| | - Ze-Chun Yuan
- Southern Crop Protection and Food Research Centre, Agriculture and Agri-Food Canada, Government of Canada London, ON, Canada ; Department of Microbiology and Immunology, Schulich School of Medicine and Dentistry, University of Western Ontario London, ON, Canada
| |
Collapse
|
38
|
Guillén Y, Rius N, Delprat A, Williford A, Muyas F, Puig M, Casillas S, Ràmia M, Egea R, Negre B, Mir G, Camps J, Moncunill V, Ruiz-Ruano FJ, Cabrero J, de Lima LG, Dias GB, Ruiz JC, Kapusta A, Garcia-Mas J, Gut M, Gut IG, Torrents D, Camacho JP, Kuhn GCS, Feschotte C, Clark AG, Betrán E, Barbadilla A, Ruiz A. Genomics of ecological adaptation in cactophilic Drosophila. Genome Biol Evol 2014; 7:349-66. [PMID: 25552534 PMCID: PMC4316639 DOI: 10.1093/gbe/evu291] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Cactophilic Drosophila species provide a valuable model to study gene–environment interactions and ecological adaptation. Drosophila buzzatii and Drosophila mojavensis are two cactophilic species that belong to the repleta group, but have very different geographical distributions and primary host plants. To investigate the genomic basis of ecological adaptation, we sequenced the genome and developmental transcriptome of D. buzzatii and compared its gene content with that of D. mojavensis and two other noncactophilic Drosophila species in the same subgenus. The newly sequenced D. buzzatii genome (161.5 Mb) comprises 826 scaffolds (>3 kb) and contains 13,657 annotated protein-coding genes. Using RNA sequencing data of five life-stages we found expression of 15,026 genes, 80% protein-coding genes, and 20% noncoding RNA genes. In total, we detected 1,294 genes putatively under positive selection. Interestingly, among genes under positive selection in the D. mojavensis lineage, there is an excess of genes involved in metabolism of heterocyclic compounds that are abundant in Stenocereus cacti and toxic to nonresident Drosophila species. We found 117 orphan genes in the shared D. buzzatii–D. mojavensis lineage. In addition, gene duplication analysis identified lineage-specific expanded families with functional annotations associated with proteolysis, zinc ion binding, chitin binding, sensory perception, ethanol tolerance, immunity, physiology, and reproduction. In summary, we identified genetic signatures of adaptation in the shared D. buzzatii–D. mojavensis lineage, and in the two separate D. buzzatii and D. mojavensis lineages. Many of the novel lineage-specific genomic features are promising candidates for explaining the adaptation of these species to their distinct ecological niches.
Collapse
Affiliation(s)
- Yolanda Guillén
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| | - Núria Rius
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| | - Alejandra Delprat
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| | | | - Francesc Muyas
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| | - Marta Puig
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| | - Sònia Casillas
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Spain
| | - Miquel Ràmia
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Spain
| | - Raquel Egea
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Spain
| | - Barbara Negre
- EMBL/CRG Research Unit in Systems Biology, Centre for Genomic Regulation (CRG), Barcelona, Spain Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Gisela Mir
- IRTA, Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Edifici CRAG, Barcelona, Spain The Peter MacCallum Cancer Centre, East Melbourne, Victoria, Australia
| | - Jordi Camps
- Centro Nacional de Análisis Genómico (CNAG), Parc Científic de Barcelona, Torre I, Barcelona, Spain
| | - Valentí Moncunill
- Barcelona Supercomputing Center (BSC), Edifici TG (Torre Girona), Barcelona, Spain and Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | | | - Josefa Cabrero
- Departamento de Genética, Facultad de Ciencias, Universidad de Granada, Spain
| | - Leonardo G de Lima
- Instituto de Ciências Biológicas, Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
| | - Guilherme B Dias
- Instituto de Ciências Biológicas, Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
| | - Jeronimo C Ruiz
- Informática de Biossistemas, Centro de Pesquisas René Rachou-Fiocruz Minas, Belo Horizonte, MG, Brazil
| | - Aurélie Kapusta
- Department of Human Genetics, University of Utah School of Medicine
| | - Jordi Garcia-Mas
- IRTA, Centre for Research in Agricultural Genomics (CRAG) CSIC-IRTA-UAB-UB, Campus UAB, Edifici CRAG, Barcelona, Spain
| | - Marta Gut
- Centro Nacional de Análisis Genómico (CNAG), Parc Científic de Barcelona, Torre I, Barcelona, Spain
| | - Ivo G Gut
- Centro Nacional de Análisis Genómico (CNAG), Parc Científic de Barcelona, Torre I, Barcelona, Spain
| | - David Torrents
- Barcelona Supercomputing Center (BSC), Edifici TG (Torre Girona), Barcelona, Spain and Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Juan P Camacho
- Departamento de Genética, Facultad de Ciencias, Universidad de Granada, Spain
| | - Gustavo C S Kuhn
- Instituto de Ciências Biológicas, Departamento de Biologia Geral, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
| | - Cédric Feschotte
- Department of Human Genetics, University of Utah School of Medicine
| | - Andrew G Clark
- Department of Molecular Biology and Genetics, Cornell University
| | - Esther Betrán
- Department of Biology, University of Texas at Arlington
| | - Antonio Barbadilla
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain Institut de Biotecnologia i de Biomedicina, Universitat Autònoma de Barcelona, Spain
| | - Alfredo Ruiz
- Departament de Genètica i de Microbiologia, Universitat Autònoma de Barcelona, Spain
| |
Collapse
|
39
|
Tatusova T, Ciufo S, Federhen S, Fedorov B, McVeigh R, O'Neill K, Tolstoy I, Zaslavsky L. Update on RefSeq microbial genomes resources. Nucleic Acids Res 2014; 43:D599-605. [PMID: 25510495 DOI: 10.1093/nar/gku1062] [Citation(s) in RCA: 104] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10,000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30,000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.
Collapse
Affiliation(s)
- Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | - Stacy Ciufo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Scott Federhen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Boris Fedorov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Richard McVeigh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kathleen O'Neill
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Igor Tolstoy
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
40
|
Lacroix T, Loux V, Gendrault A, Hoebeke M, Gibrat JF. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol! Nucleic Acids Res 2014; 42:gku867. [PMID: 25249626 PMCID: PMC4245967 DOI: 10.1093/nar/gku867] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2013] [Revised: 08/28/2014] [Accepted: 09/10/2014] [Indexed: 11/14/2022] Open
Abstract
High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis.
Collapse
Affiliation(s)
- Thomas Lacroix
- INRA, UR 1077 Mathématique Informatique et Génome, 78352 Jouy-en-Josas, France
| | - Valentin Loux
- INRA, UR 1077 Mathématique Informatique et Génome, 78352 Jouy-en-Josas, France
| | - Annie Gendrault
- INRA, UR 1077 Mathématique Informatique et Génome, 78352 Jouy-en-Josas, France
| | - Mark Hoebeke
- CNRS, UPMC, FR2424, ABiMS, Station Biologique, 29680 Roscoff, France
| | | |
Collapse
|
41
|
Toby IT, Widmer J, Dyer DW. Divergence of protein-coding capacity and regulation in the Bacillus cereus sensu lato group. BMC Bioinformatics 2014; 15 Suppl 11:S8. [PMID: 25350501 PMCID: PMC4251056 DOI: 10.1186/1471-2105-15-s11-s8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The Bacillus cereus sensu lato group contains ubiquitous facultative anaerobic soil-borne Gram-positive spore-forming bacilli. Molecular phylogeny and comparative genome sequencing have suggested that these organisms should be classified as a single species. While clonal in nature, there do not appear to be species-specific clonal lineages, excepting B. anthracis, in spite of the wide array of phenotypes displayed by these organisms. RESULTS We compared the protein-coding content of 201 B. cereus sensu lato genomes to characterize differences and understand the consequences of these differences on biological function. From this larger group we selected a subset consisting of 25 whole genomes for deeper analysis. Cluster analysis of orthologous proteins grouped these genomes into five distinct clades. Each clade could be characterized by unique genes shared among the group, with consequences for the phenotype of each clade. Surprisingly, this population structure recapitulates our recent observations on the divergence of the generalized stress response (SigB) regulons in these organisms. Divergence of the SigB regulon among these organisms is primarily due to the placement of SigB-dependent promoters that bring genes from a common gene pool into/out of the SigB regulon. CONCLUSIONS Collectively, our observations suggest the hypothesis that the evolution of these closely related bacteria is a consequence of two distinct processes. Horizontal gene transfer, gene duplication/divergence and deletion dictate the underlying coding capacity in these genomes. Regulatory divergence overlays this protein coding reservoir and shapes the expression of both the unique and shared coding capacity of these organisms, resulting in phenotypic divergence. Data from other organisms suggests that this is likely a common pattern in prokaryotic evolution.
Collapse
Affiliation(s)
- Inimary T Toby
- University of Oklahoma Health Sciences Center, 975 NE 10th Street, BRC-1106, Oklahoma City, OK 73104, USA
| | - Jonah Widmer
- University of Oklahoma Health Sciences Center, 975 NE 10th Street, BRC-1106, Oklahoma City, OK 73104, USA
| | - David W Dyer
- University of Oklahoma Health Sciences Center, 975 NE 10th Street, BRC-1106, Oklahoma City, OK 73104, USA
| |
Collapse
|
42
|
Schliebner I, Becher R, Hempel M, Deising HB, Horbach R. New gene models and alternative splicing in the maize pathogen Colletotrichum graminicola revealed by RNA-Seq analysis. BMC Genomics 2014; 15:842. [PMID: 25281481 PMCID: PMC4194422 DOI: 10.1186/1471-2164-15-842] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Accepted: 09/09/2014] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND An annotated genomic sequence of the corn anthracnose fungus Colletotrichum graminicola has been published previously, but correct identification of gene models by means of automated gene annotation remains a challenge. RNA-Seq offers the potential for substantially improved gene annotations and for the identification of posttranscriptional RNA modifications, such as alternative splicing and RNA editing. RESULTS Based on the nucleotide sequence information of transcripts, we identified 819 novel transcriptionally active regions (nTARs) and revised 906 incorrectly predicted gene models, including revisions of exon-intron structure, gene orientation and sequencing errors. Among the nTARs, 146 share significant similarity with proteins that have been identified in other species suggesting that they are hitherto unidentified genes in C. graminicola. Moreover, 5'- and 3'-UTR sequences of 4378 genes have been retrieved and alternatively spliced variants of 69 genes have been identified. Comparative analysis of RNA-Seq data and the genome sequence did not provide evidence for RNA editing in C. graminicola. CONCLUSIONS We successfully employed deep sequencing RNA-Seq data in combination with an elaborate bioinformatics strategy in order to identify novel genes, incorrect gene models and mechanisms of transcript processing in the corn anthracnose fungus C. graminicola. Sequence data of the revised genome annotation including several hundreds of novel transcripts, improved gene models and candidate genes for alternative splicing have been made accessible in a comprehensive database. Our results significantly contribute to both routine laboratory experiments and large-scale genomics or transcriptomic studies in C. graminicola.
Collapse
Affiliation(s)
- Ivo Schliebner
- />Interdisciplinary Center for Crop Plant Research, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
| | - Rayko Becher
- />Interdisciplinary Center for Crop Plant Research, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
| | - Marcus Hempel
- />Interdisciplinary Center for Crop Plant Research, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
| | - Holger B Deising
- />Interdisciplinary Center for Crop Plant Research, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
- />Institute for Agricultural and Nutritional Sciences, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
| | - Ralf Horbach
- />Interdisciplinary Center for Crop Plant Research, Martin-Luther-University Halle-Wittenberg, Betty-Heimann-Str. 3, D-06120 Halle (Saale), Germany
| |
Collapse
|
43
|
-Biao Guo F, Lin Y, -Ling Chen L. Recognition of Protein-coding Genes Based on Z-curve Algorithms. Curr Genomics 2014; 15:95-103. [PMID: 24822027 PMCID: PMC4009845 DOI: 10.2174/1389202915999140328162724] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Revised: 11/19/2013] [Accepted: 11/20/2013] [Indexed: 01/18/2023] Open
Abstract
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.
Collapse
Affiliation(s)
- Feng -Biao Guo
- Center of Bioinformatics and Key Laboratory for NeuroInformation of the Ministry of Education, University of Elec-tronic Science and Technology of China, Chengdu, 610054, China
| | - Yan Lin
- Department of Physics, Tianjin University, Tianjin 300072, China
| | - Ling -Ling Chen
- cCollege of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
44
|
SearchDOGS bacteria, software that provides automated identification of potentially missed genes in annotated bacterial genomes. J Bacteriol 2014; 196:2030-42. [PMID: 24659774 DOI: 10.1128/jb.01368-13] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
We report the development of SearchDOGS Bacteria, software to automatically detect missing genes in annotated bacterial genomes by combining BLAST searches with comparative genomics. Having successfully applied the approach to yeast genomes, we redeveloped SearchDOGS to function as a standalone, downloadable package, requiring only a set of GenBank annotation files as input. The software automatically generates a homology structure using reciprocal BLAST and a synteny-based method; this is followed by a scan of the entire genome of each species for unannotated genes. Results are provided in a HTML interface, providing coordinates, BLAST results, syntenic location, omega values (Ka/Ks, where Ks is the number of synonymous substitutions per synonymous site and Ka is the number of nonsynonymous substitutions per nonsynonymous site) for protein conservation estimates, and other information for each candidate gene. Using SearchDOGS Bacteria, we identified 155 gene candidates in the Shigella boydii sb227 genome, including 56 candidates of length < 60 codons. SearchDOGS Bacteria has two major advantages over currently available annotation software. First, it outperforms current methods in terms of sensitivity and is highly effective at identifying small or highly diverged genes. Second, as a freely downloadable package, it can be used with unpublished or confidential data.
Collapse
|
45
|
Wozniak M, Wong L, Tiuryn J. eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinformatics 2014; 15:65. [PMID: 24597904 PMCID: PMC4023553 DOI: 10.1186/1471-2105-15-65] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2013] [Accepted: 02/24/2014] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Inconsistencies are often observed in the genome annotations of bacterial strains. Moreover, these inconsistencies are often not reflected by sequence discrepancies, but are caused by wrongly annotated gene starts as well as mis-identified gene presence. Thus, tools are needed for improving annotation consistency and accuracy among sets of bacterial strain genomes. RESULTS We have developed eCAMBer, a tool for efficiently supporting comparative analysis of multiple bacterial strains within the same species. eCAMBer is a highly optimized revision of our earlier tool, CAMBer, scaling it up for significantly larger datasets comprising hundreds of bacterial strains. eCAMBer works in two phases. First, it transfers gene annotations among all considered bacterial strains. In this phase, it also identifies homologous gene families and annotation inconsistencies. Second, eCAMBer, tries to improve the quality of annotations by resolving the gene start inconsistencies and filtering out gene families arising from annotation errors propagated in the previous phase. CONCLUSIONS [corrected] eCAMBer efficiently identifies and resolves annotation inconsistencies among closely related bacterial genomes. It outperforms other competing tools both in terms of running time and accuracy of produced annotations. Software, user manual, and case study results are available at the project website: http://bioputer.mimuw.edu.pl/ecamber.
Collapse
Affiliation(s)
- Michal Wozniak
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.
| | | | | |
Collapse
|
46
|
Abstract
DNAs encoding polypeptides often contain design errors that cause experiments to prematurely fail. One class of design errors is incorrect or missing elements in the DNA, here termed syntax errors. We have identified three major causes of syntax errors: point mutations from sequencing or manual data entry, gene structure misannotation, and unintended open reading frames (ORFs). The Engineered DNA Sequence Syntax Inspector (EDSSI) is an online bioinformatics pipeline that checks for syntax errors through three steps. First, ORF prediction in input DNA sequences is done by GeneMark; next, homologous sequences are retrieved by BLAST, and finally, syntax errors in the protein sequence are predicted by using the SIFT algorithm. We show that the EDSSI is able to identify previously published examples of syntactical errors and also show that our indel addition to the SIFT program is 97% accurate on a test set of Escherichia coli proteins. The EDSSI is available at http://andersonlab.qb3.berkeley.edu/Software/EDSSI/ .
Collapse
Affiliation(s)
| | - J. Christopher Anderson
- Bioengineering Department, University of California, Berkeley, California 94720, United States
| |
Collapse
|
47
|
Bland C, Hartmann EM, Christie-Oleza JA, Fernandez B, Armengaud J. N-Terminal-oriented proteogenomics of the marine bacterium roseobacter denitrificans Och114 using N-Succinimidyloxycarbonylmethyl)tris(2,4,6-trimethoxyphenyl)phosphonium bromide (TMPP) labeling and diagonal chromatography. Mol Cell Proteomics 2014; 13:1369-81. [PMID: 24536027 DOI: 10.1074/mcp.o113.032854] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Given the ease of whole genome sequencing with next-generation sequencers, structural and functional gene annotation is now purely based on automated prediction. However, errors in gene structure are frequent, the correct determination of start codons being one of the main concerns. Here, we combine protein N termini derivatization using (N-Succinimidyloxycarbonylmethyl)tris(2,4,6-trimethoxyphenyl)phosphonium bromide (TMPP Ac-OSu) as a labeling reagent with the COmbined FRActional DIagonal Chromatography (COFRADIC) sorting method to enrich labeled N-terminal peptides for mass spectrometry detection. Protein digestion was performed in parallel with three proteases to obtain a reliable automatic validation of protein N termini. The analysis of these N-terminal enriched fractions by high-resolution tandem mass spectrometry allowed the annotation refinement of 534 proteins of the model marine bacterium Roseobacter denitrificans OCh114. This study is especially efficient regarding mass spectrometry analytical time. From the 534 validated N termini, 480 confirmed existing gene annotations, 41 highlighted erroneous start codon annotations, five revealed totally new mis-annotated genes; the mass spectrometry data also suggested the existence of multiple start sites for eight different genes, a result that challenges the current view of protein translation initiation. Finally, we identified several proteins for which classical genome homology-driven annotation was inconsistent, questioning the validity of automatic annotation pipelines and emphasizing the need for complementary proteomic data. All data have been deposited to the ProteomeXchange with identifier PXD000337.
Collapse
Affiliation(s)
- Céline Bland
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, F-30207, France
| | | | | | | | | |
Collapse
|
48
|
Armengaud J, Trapp J, Pible O, Geffard O, Chaumot A, Hartmann EM. Non-model organisms, a species endangered by proteogenomics. J Proteomics 2014; 105:5-18. [PMID: 24440519 DOI: 10.1016/j.jprot.2014.01.007] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2013] [Revised: 12/24/2013] [Accepted: 01/07/2014] [Indexed: 10/25/2022]
Abstract
UNLABELLED Previously, large-scale proteomics was possible only for organisms whose genomes were sequenced, meaning the most common model organisms. The use of next-generation sequencers is now changing the deal. With "proteogenomics", the use of experimental proteomics data to refine genome annotations, a higher integration of omics data is gaining ground. By extension, combining genomic and proteomic data is becoming routine in many research projects. "Proteogenomic"-flavored approaches are currently expanding, enabling the molecular studies of non-model organisms at an unprecedented depth. Today draft genomes can be obtained using next-generation sequencers in a rather straightforward way and at a reasonable cost for any organism. Unfinished genome sequences can be used to interpret tandem mass spectrometry proteomics data without the need for time-consuming genome annotation, and the use of RNA-seq to establish nucleotide sequences that are directly translated into protein sequences appears promising. There are, however, certain drawbacks that deserve further attention for RNA-seq to become more efficient. Here, we discuss the opportunities of working with non-model organisms, the proteomic methods that have been used until now, and the dramatic improvements proffered by proteogenomics. These put the distinction between model and non-model organisms in great danger, at least in terms of proteomics! BIOLOGICAL SIGNIFICANCE Model organisms have been crucial for in-depth analysis of cellular and molecular processes of life. Focusing the efforts of thousands of researchers on the Escherichia coli bacterium, Saccharomyces cerevisiae yeast, Arabidopsis thaliana plant, Danio rerio fish and other models for which genetic manipulation was possible was certainly worthwhile in terms of fundamental and invaluable biological insights. Until recently, proteomics of non-model organisms was limited to tedious, homology-based techniques, but today draft genomes or RNA-seq data can be straightforwardly obtained using next-generation sequencers, allowing the establishment of a draft protein database for any organism. Thus, proteogenomics opens new perspectives for molecular studies of non-model organisms, although they are still difficult experimental organisms. This article is part of a Special Issue entitled: Proteomics of non-model organisms.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France.
| | - Judith Trapp
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France; Irstea, UR MALY, F-69626 Villeurbanne, France
| | - Olivier Pible
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France
| | | | | | - Erica M Hartmann
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France
| |
Collapse
|
49
|
Coker OO, Warit S, Rukseree K, Summpunn P, Prammananan T, Palittapongarnpim P. Functional characterization of two members of histidine phosphatase superfamily in Mycobacterium tuberculosis. BMC Microbiol 2013; 13:292. [PMID: 24330471 PMCID: PMC3866925 DOI: 10.1186/1471-2180-13-292] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2013] [Accepted: 12/07/2013] [Indexed: 01/19/2023] Open
Abstract
Background Functional characterization of genes in important pathogenic bacteria such as Mycobacterium tuberculosis is imperative. Rv2135c, which was originally annotated as conserved hypothetical, has been found to be associated with membrane protein fractions of H37Rv strain. The gene appears to contain histidine phosphatase motif common to both cofactor-dependent phosphoglycerate mutases and acid phosphatases in the histidine phosphatase superfamily. The functions of many of the members of this superfamily are annotated based only on similarity to known proteins using automatic annotation systems, which can be erroneous. In addition, the motif at the N-terminal of Rv2135c is ‘RHA’ unlike ‘RHG’ found in most members of histidine phosphatase superfamily. These necessitate the need for its experimental characterization. The crystal structure of Rv0489, another member of the histidine phosphatase superfamily in M. tuberculosis, has been previously reported. However, its biochemical characteristics remain unknown. In this study, Rv2135c and Rv0489 from M. tuberculosis were cloned and expressed in Escherichia coli with 6 histidine residues tagged at the C terminal. Results Characterization of the purified recombinant proteins revealed that Rv0489 possesses phosphoglycerate mutase activity while Rv2135c does not. However Rv2135c has an acid phosphatase activity with optimal pH of 5.8. Kinetic parameters of Rv2135c and Rv0489 are studied, confirming that Rv0489 is a cofactor dependent phosphoglycerate mutase of M. tuberculosis. Additional characterization showed that Rv2135c exists as a tetramer while Rv0489 as a dimer in solution. Conclusion Most of the proteins orthologous to Rv2135c in other bacteria are annotated as phosphoglycerate mutases or hypothetical proteins. It is possible that they are actually phosphatases. Experimental characterization of a sufficiently large number of bacterial histidine phosphatases will increase the accuracy of the automatic annotation systems towards a better understanding of this important group of enzymes.
Collapse
Affiliation(s)
| | | | | | | | | | - Prasit Palittapongarnpim
- Department of Microbiology, Faculty of Science, Mahidol University, Rama 6 Road, Bangkok 10400, Thailand.
| |
Collapse
|
50
|
Tatusova T, Ciufo S, Fedorov B, O'Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res 2013; 42:D553-9. [PMID: 24316578 PMCID: PMC3965038 DOI: 10.1093/nar/gkt1274] [Citation(s) in RCA: 318] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Collapse
Affiliation(s)
- Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bldg. 38A 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|