1
|
Erkelens AM, van Erp B, Meijer WJJ, Dame RT. Rok from B. subtilis: Bridging genome structure and transcription regulation. Mol Microbiol 2025; 123:109-123. [PMID: 38511404 PMCID: PMC11841835 DOI: 10.1111/mmi.15250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 03/02/2024] [Accepted: 03/07/2024] [Indexed: 03/22/2024]
Abstract
Bacterial genomes are folded and organized into compact yet dynamic structures, called nucleoids. Nucleoid orchestration involves many factors at multiple length scales, such as nucleoid-associated proteins and liquid-liquid phase separation, and has to be compatible with replication and transcription. Possibly, genome organization plays an intrinsic role in transcription regulation, in addition to classical transcription factors. In this review, we provide arguments supporting this view using the Gram-positive bacterium Bacillus subtilis as a model. Proteins BsSMC, HBsu and Rok all impact the structure of the B. subtilis chromosome. Particularly for Rok, there is compelling evidence that it combines its structural function with a role as global gene regulator. Many studies describe either function of Rok, but rarely both are addressed at the same time. Here, we review both sides of the coin and integrate them into one model. Rok forms unusually stable DNA-DNA bridges and this ability likely underlies its repressive effect on transcription by either preventing RNA polymerase from binding to DNA or trapping it inside DNA loops. Partner proteins are needed to change or relieve Rok-mediated gene repression. Lastly, we investigate which features characterize H-NS-like proteins, a family that, at present, lacks a clear definition.
Collapse
Affiliation(s)
- Amanda M. Erkelens
- Leiden Institute of Chemistry, Leiden UniversityLeidenthe Netherlands
- Centre for Microbial Cell BiologyLeiden UniversityLeidenthe Netherlands
- Centre for Interdisciplinary Genome ResearchLeiden UniversityLeidenthe Netherlands
- Present address:
Department of Human GeneticsLeiden University Medical CenterLeidenthe Netherlands
| | - Bert van Erp
- Leiden Institute of Chemistry, Leiden UniversityLeidenthe Netherlands
- Centre for Microbial Cell BiologyLeiden UniversityLeidenthe Netherlands
- Centre for Interdisciplinary Genome ResearchLeiden UniversityLeidenthe Netherlands
| | - Wilfried J. J. Meijer
- Centro de Biología Molecular Severo Ochoa (CSIC‐UAM)C. Nicolás Cabrera 1, Universidad AutónomaMadridSpain
| | - Remus T. Dame
- Leiden Institute of Chemistry, Leiden UniversityLeidenthe Netherlands
- Centre for Microbial Cell BiologyLeiden UniversityLeidenthe Netherlands
- Centre for Interdisciplinary Genome ResearchLeiden UniversityLeidenthe Netherlands
| |
Collapse
|
2
|
Burks DJ, Azad RK. Mapping Strengths and Weaknesses of Different Clustering Approaches to Deciphering Bacterial Chimerism. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:422-439. [PMID: 35925817 DOI: 10.1089/omi.2022.0062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Bacterial genomes are chimeras of DNA of different ancestries. Deconstructing chimeric genomes is central to understanding the evolutionary trajectories of their disparate components and thus the organisms as a whole in the light of their evolutionary contexts. Of specific interest is to delineate and quantify native (vertically inherited) and alien (horizontally acquired) components of bacterial genomes and also specify genomic fractions that represent different donor sources. An agglomerative clustering procedure that prioritizes grouping of proximal similar genomic segments has previously been invoked for this purpose in conjunction with a recursive segmentation procedure. Surprisingly, however, the relative strengths and weaknesses of different clustering approaches to deciphering bacterial chimerism have not yet been investigated, despite the need to robustly interpret tens of thousands of completely sequenced bacterial genomes and nearly complete genome assemblies available in the public databases. To bridge this knowledge gap and develop more robust approaches, we assessed different clustering methods, including segment order based (proximal) clustering, hierarchical clustering, affinity propagation clustering, and a novel network clustering approach on chimeric genomes modeled after bacterial genomes representing a broad spectrum of compositional complexity. Although segment order-based clustering and network clustering compared favorably with the other approaches in discriminating between native and alien DNA at genome optimized settings, network clustering did consistently better than other methods at parametric settings optimized on all test genomes together. Segment order-based clustering and hierarchical clustering outperformed other methods in alien DNA identification while preserving donor identity in the genomes. Our study highlights the strengths and weaknesses of different approaches and suggests how this can be leveraged to achieve a more robust deconstruction of bacterial chimerism.
Collapse
Affiliation(s)
- David J Burks
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
| | - Rajeev K Azad
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
- Department of Mathematics, University of North Texas, Denton, Texas, USA
| |
Collapse
|
3
|
Pandey RS, Azad RK. Factors That Influence the Choice of Markov Model Order in Discriminating DNA Sequences from Different Sources. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:348-355. [PMID: 35648077 DOI: 10.1089/omi.2022.0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Markov models have frequently been used in genetic sequence analysis. The number of parameters of a Markov model increases exponentially with model order, so it is often recommended that the order be chosen based on the size of data being modeled, lower orders for small and higher orders for large dataset sizes. Approaches based on model selection criterion have also been proposed. An important problem in microbiology and evolutionary biology is to decipher chimeric genomes of microbes, particularly, identify segments of distinct ancestries in genomes and reconstruct the plausible evolutionary scenarios that might have shaped the chimeric genomes in the microbial world. In this study, we assessed a Markov model-based segmentation method for its ability to detect compositionally disparate segments in chimeric sequence constructs as a function of model order, sequence length, and phylogenetic divergence. Our results show that the choice of Markov model order depends on both sequence size and composition. Higher order Markov models were found to be more effective in delineating sequence segments arising from closely related organisms in longer constructs; on the other hand, lower order Markov models were found to be more appropriate in delineating sequence segments arising from distantly related organisms in shorter constructs. These findings are important and timely, with broad implications in fields such as epidemiology that has to deal with the emergence of novel pathogenic chimeras that arise by foreign DNA acquisition, and ecology where chimeric structures may arise in various ecosystems, necessitating more robust approaches for their deconstruction and interpretation.
Collapse
Affiliation(s)
- Ravi S Pandey
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
| | - Rajeev K Azad
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
- Department of Mathematics, University of North Texas, Denton, Texas, USA
| |
Collapse
|
4
|
Multi-omics Approach Reveals How Yeast Extract Peptides Shape Streptococcus thermophilus Metabolism. Appl Environ Microbiol 2020; 86:AEM.01446-20. [PMID: 32769193 DOI: 10.1128/aem.01446-20] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 08/04/2020] [Indexed: 12/28/2022] Open
Abstract
Peptides present in growth media are essential for nitrogen nutrition and optimal growth of lactic acid bacteria. In addition, according to their amino acid composition, they can also directly or indirectly play regulatory roles and influence global metabolism. This is especially relevant during the propagation phase to produce high cell counts of active lactic acid bacteria used as starters in the dairy industry. In the present work, we aimed at investigating how the respective compositions of two different yeast extracts, with a specific focus on peptide content, influenced Streptococcus thermophilus metabolism during growth under pH-controlled conditions. In addition to free amino acid quantification, we used a multi-omics approach (peptidomics, proteomics, and transcriptomics) to identify peptides initially present in the two culture media and to follow S. thermophilus gene expression and bacterial protein production during growth. The free amino acid and peptide compositions of the two yeast extracts differed qualitatively and quantitatively. Nevertheless, the two yeast extracts sustained similar levels of growth of S. thermophilus and led to equivalent final biomasses. However, transcriptomics and proteomics showed differential gene expression and protein production in several S. thermophilus metabolic pathways, especially amino acid, citrate, urease, purine, and pyrimidine metabolisms. The probable role of the regulator CodY is discussed in this context. Moreover, we observed significant differences in the production of regulators and of a quorum sensing regulatory system. The possible roles of yeast extract peptides on the modulation of the quorum sensing system expression are evaluated.IMPORTANCE Improving the performance and industrial robustness of bacteria used in fermentations and food industry remains a challenge. We showed here that two Streptococcus thermophilus fermentations, performed with the same strain in media that differ only by their yeast extract compositions and, more especially, their peptide contents, led to similar growth kinetics and final biomasses, but several genes and proteins were differentially expressed/produced. In other words, subtle variations in peptide composition of the growth medium can finely tune the metabolism status of the starter. Our work, therefore, suggests that acting on growth medium components and especially on their peptide content, we could modulate bacterial metabolism and produce bacteria differently programmed for further purposes. This might have applications for preparing active starter cultures.
Collapse
|
5
|
Lingeswaran A, Metton C, Henry C, Monnet V, Juillard V, Gardan R. Export of Rgg Quorum Sensing Peptides is Mediated by the PptAB ABC Transporter in Streptococcus Thermophilus Strain LMD-9. Genes (Basel) 2020; 11:genes11091096. [PMID: 32961685 PMCID: PMC7564271 DOI: 10.3390/genes11091096] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 09/10/2020] [Accepted: 09/17/2020] [Indexed: 12/26/2022] Open
Abstract
In streptococci, intracellular quorum sensing pathways are based on quorum-sensing systems that are responsible for peptide secretion, maturation, and reimport. These peptides then interact with Rgg or ComR transcriptional regulators in the Rap, Rgg, NprR, PlcR, and PrgX (RRNPP) family, whose members are found in Gram-positive bacteria. Short hydrophobic peptides (SHP) interact with Rgg whereas ComS peptides interact with ComR regulators. To date, in Streptococcus thermophilus, peptide secretion, maturation, and extracellular fate have received little attention, even though this species has several (at least five) genes encoding Rgg regulators and one encoding a ComR regulator. We studied pheromone export in this species, focusing our attention on PptAB, which is an exporter of signaling peptides previously identified in Enterococcus faecalis, pathogenic streptococci and Staphylococcus aureus. In the S. thermophilus strain LMD-9, we showed that PptAB controlled three regulation systems, two SHP/Rgg systems (SHP/Rgg1358 and SHP/Rgg1299), and the ComS/ComR system, while using transcriptional fusions and that PptAB helped to produce and export at least three different mature SHPs (SHP1358, SHP1299, and SHP279) peptides while using liquid chromatography-tandem mass spectrometry (LC-MS/MS). Using a deep sequencing approach (RNAseq), we showed that the exporter PptAB, the membrane protease Eep, and the oligopeptide importer Ami controlled the transcription of the genes that were located downstream from the five non-truncated rgg genes as well as few distal genes. This led us to propose that the five non-truncated shp/rgg loci were functional. Only three shp genes were expressed in our experimental condition. Thus, this transcriptome analysis also highlighted the complex interconnected network that exists between SHP/Rgg systems, where a few homologous signaling peptides likely interact with different regulators.
Collapse
|
6
|
Arndt D, Marcu A, Liang Y, Wishart DS. PHAST, PHASTER and PHASTEST: Tools for finding prophage in bacterial genomes. Brief Bioinform 2020; 20:1560-1567. [PMID: 29028989 DOI: 10.1093/bib/bbx121] [Citation(s) in RCA: 136] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 07/31/2017] [Indexed: 11/13/2022] Open
Abstract
PHAST (PHAge Search Tool) and its successor PHASTER (PHAge Search Tool - Enhanced Release) have become two of the most widely used web servers for identifying putative prophages in bacterial genomes. Here we review the main capabilities of these web resources, provide some practical guidance regarding their use and discuss possible future improvements. PHAST, which was first described in 2011, made its debut just as whole bacterial genome sequencing and was becoming inexpensive and relatively routine. PHAST quickly gained popularity among bacterial genome researchers because of its web accessibility, its ease of use along with its enhanced accuracy and rapid processing times. PHASTER, which appeared in 2016, provided a number of much-needed enhancements to the PHAST server, including greater processing speed (to cope with very large submission volumes), increased database sizes, a more modern user interface, improved graphical displays and support for metagenomic submissions. Continuing developments in the field, along with increased interest in automated phage and prophage finding, have already led to several improvements to the PHASTER server and will soon lead to the development of a successor to PHASTER (to be called PHASTEST).
Collapse
|
7
|
Tang K, Lu YY, Sun F. Background Adjusted Alignment-Free Dissimilarity Measures Improve the Detection of Horizontal Gene Transfer. Front Microbiol 2018; 9:711. [PMID: 29713314 PMCID: PMC5911508 DOI: 10.3389/fmicb.2018.00711] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 03/27/2018] [Indexed: 11/20/2022] Open
Abstract
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and d2* that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, d2* with word length 3, Markov order 1 and d2* with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.
Collapse
Affiliation(s)
- Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
8
|
Duchaud E, Rochat T, Habib C, Barbier P, Loux V, Guérin C, Dalsgaard I, Madsen L, Nilsen H, Sundell K, Wiklund T, Strepparava N, Wahli T, Caburlotto G, Manfrin A, Wiens GD, Fujiwara-Nagata E, Avendaño-Herrera R, Bernardet JF, Nicolas P. Genomic Diversity and Evolution of the Fish Pathogen Flavobacterium psychrophilum. Front Microbiol 2018; 9:138. [PMID: 29467746 PMCID: PMC5808330 DOI: 10.3389/fmicb.2018.00138] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Accepted: 01/22/2018] [Indexed: 12/04/2022] Open
Abstract
Flavobacterium psychrophilum, the etiological agent of rainbow trout fry syndrome and bacterial cold-water disease in salmonid fish, is currently one of the main bacterial pathogens hampering the productivity of salmonid farming worldwide. In this study, the genomic diversity of the F. psychrophilum species is analyzed using a set of 41 genomes, including 30 newly sequenced isolates. These were selected on the basis of available MLST data with the two-fold objective of maximizing the coverage of the species diversity and of allowing a focus on the main clonal complex (CC-ST10) infecting farmed rainbow trout (Oncorhynchus mykiss) worldwide. The results reveal a bacterial species harboring a limited genomic diversity both in terms of nucleotide diversity, with ~0.3% nucleotide divergence inside CDSs in pairwise genome comparisons, and in terms of gene repertoire, with the core genome accounting for ~80% of the genes in each genome. The pan-genome seems nevertheless “open” according to the scaling exponent of a power-law fitted on the rate of new gene discovery when genomes are added one-by-one. Recombination is a key component of the evolutionary process of the species as seen in the high level of apparent homoplasy in the core genome. Using a Hidden Markov Model to delineate recombination tracts in pairs of closely related genomes, the average recombination tract length was estimated to ~4.0 Kbp and the typical ratio of the contributions of recombination and mutations to nucleotide-level differentiation (r/m) was estimated to ~13. Within CC-ST10, evolutionary distances computed on non-recombined regions and comparisons between 22 isolates sampled up to 27 years apart suggest a most recent common ancestor in the second half of the nineteenth century in North America with subsequent diversification and transmission of this clonal complex coinciding with the worldwide expansion of rainbow trout farming. With the goal to promote the development of tools for the genetic manipulation of F. psychrophilum, a particular attention was also paid to plasmids. Their extraction and sequencing to completion revealed plasmid diversity that remained hidden to classical plasmid profiling due to size similarities.
Collapse
Affiliation(s)
- Eric Duchaud
- Unité de Virologie et Immunologie Moléculaires (VIM), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Tatiana Rochat
- Unité de Virologie et Immunologie Moléculaires (VIM), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Christophe Habib
- Unité de Virologie et Immunologie Moléculaires (VIM), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France.,Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement (MaIAGE), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Paul Barbier
- Unité de Virologie et Immunologie Moléculaires (VIM), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Valentin Loux
- Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement (MaIAGE), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Cyprien Guérin
- Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement (MaIAGE), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Inger Dalsgaard
- Section for Bacteriology and Pathology, National Veterinary Institute, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Lone Madsen
- Section for Bacteriology and Pathology, National Veterinary Institute, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Hanne Nilsen
- Department of Aquatic Animal health, Norwegian Veterinary Institute, Bergen, Norway
| | - Krister Sundell
- Laboratory of Aquatic Pathobiology, Environmental and Marine Biology, Faculty of Science and Engineering, Åbo Akademi University, Turku, Finland
| | - Tom Wiklund
- Laboratory of Aquatic Pathobiology, Environmental and Marine Biology, Faculty of Science and Engineering, Åbo Akademi University, Turku, Finland
| | - Nicole Strepparava
- Laboratory of Applied Microbiology, Department for Environment Constructions and Design, University of Applied Sciences and Arts of Southern Switzerland (SUPSI), Bellinzona, Switzerland
| | - Thomas Wahli
- Centre for Fish and Wildlife Health (FIWI), University of Bern, Bern, Switzerland
| | - Greta Caburlotto
- Department of Fish Pathology, Istituto Zooprofilattico Sperimentale delle Venezie, Legnaro, Italy
| | - Amedeo Manfrin
- Department of Fish Pathology, Istituto Zooprofilattico Sperimentale delle Venezie, Legnaro, Italy
| | - Gregory D Wiens
- National Center for Cool and Cold Water Aquaculture, Agricultural Research Service, United States Department of Agriculture, Kearneysville, WV, United States
| | | | - Ruben Avendaño-Herrera
- Departamento Facultad de Ciencias Biológicas, Universidad Andres Bello, Universidad Andres BelloViña del Mar, Interdisciplinary Center for Aquaculture Research, Concepción, Chile
| | - Jean-François Bernardet
- Unité de Virologie et Immunologie Moléculaires (VIM), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| | - Pierre Nicolas
- Unité Mathématiques et Informatique Appliquées du Génome à l'Environnement (MaIAGE), Institut National de la Recherche Agronomique, Université Paris-Saclay, Jouy-en-Josas, France
| |
Collapse
|
9
|
Lapuyade-Lahorgue J, Xue JH, Ruan S. Segmenting Multi-Source Images Using Hidden Markov Fields With Copula-Based Multivariate Statistical Distributions. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:3187-3195. [PMID: 28333631 DOI: 10.1109/tip.2017.2685345] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Nowadays, multi-source image acquisition attracts an increasing interest in many fields, such as multi-modal medical image segmentation. Such acquisition aims at considering complementary information to perform image segmentation, since the same scene has been observed by various types of images. However, strong dependence often exists between multi-source images. This dependence should be taken into account when we try to extract joint information for precisely making a decision. In order to statistically model this dependence between multiple sources, we propose a novel multi-source fusion method based on the Gaussian copula. The proposed fusion model is integrated in a statistical framework with the hidden Markov field inference in order to delineate a target volume from multi-source images. Estimation of parameters of the models and segmentation of the images are jointly performed by an iterative algorithm based on Gibbs sampling. Experiments are performed on multi-sequence MRI to segment tumors. The results show that the proposed method based on the Gaussian copula is effective to accomplish multi-source image segmentation.
Collapse
|
10
|
Jani M, Mathee K, Azad RK. Identification of Novel Genomic Islands in Liverpool Epidemic Strain of Pseudomonas aeruginosa Using Segmentation and Clustering. Front Microbiol 2016; 7:1210. [PMID: 27536294 PMCID: PMC4971588 DOI: 10.3389/fmicb.2016.01210] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Accepted: 07/20/2016] [Indexed: 02/03/2023] Open
Abstract
Pseudomonas aeruginosa is an opportunistic pathogen implicated in a myriad of infections and a leading pathogen responsible for mortality in patients with cystic fibrosis (CF). Horizontal transfers of genes among the microorganisms living within CF patients have led to highly virulent and multi-drug resistant strains such as the Liverpool epidemic strain of P. aeruginosa, namely the LESB58 strain that has the propensity to acquire virulence and antibiotic resistance genes. Often these genes are acquired in large clusters, referred to as "genomic islands (GIs)." To decipher GIs and understand their contributions to the evolution of virulence and antibiotic resistance in P. aeruginosa LESB58, we utilized a recursive segmentation and clustering procedure, presented here as a genome-mining tool, "GEMINI." GEMINI was validated on experimentally verified islands in the LESB58 strain before examining its potential to decipher novel islands. Of the 6062 genes in P. aeruginosa LESB58, 596 genes were identified to be resident on 20 GIs of which 12 have not been previously reported. Comparative genomics provided evidence in support of our novel predictions. Furthermore, GEMINI unraveled the mosaic structure of islands that are composed of segments of likely different evolutionary origins, and demonstrated its ability to identify potential strain biomarkers. These newly found islands likely have contributed to the hyper-virulence and multidrug resistance of the Liverpool epidemic strain of P. aeruginosa.
Collapse
Affiliation(s)
- Mehul Jani
- Department of Biological Sciences, University of North Texas Denton, TX, USA
| | - Kalai Mathee
- Department of Human and Molecular Genetics, Herbert Wertheim College of Medicine Global Health Consortium, and Biomolecular Sciences Institute, Florida International University Miami, FL, USA
| | - Rajeev K Azad
- Department of Biological Sciences, University of North TexasDenton, TX, USA; Department of Mathematics, University of North TexasDenton, TX, USA
| |
Collapse
|
11
|
Caprari S, Metzler S, Lengauer T, Kalinina OV. Sequence and Structure Analysis of Distantly-Related Viruses Reveals Extensive Gene Transfer between Viruses and Hosts and among Viruses. Viruses 2015; 7:5388-409. [PMID: 26492264 PMCID: PMC4632390 DOI: 10.3390/v7102882] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Revised: 10/08/2015] [Accepted: 10/09/2015] [Indexed: 12/20/2022] Open
Abstract
The origin and evolution of viruses is a subject of ongoing debate. In this study, we provide a full account of the evolutionary relationships between proteins of significant sequence and structural similarity found in viruses that belong to different classes according to the Baltimore classification. We show that such proteins can be found in viruses from all Baltimore classes. For protein families that include these proteins, we observe two patterns of the taxonomic spread. In the first pattern, they can be found in a large number of viruses from all implicated Baltimore classes. In the other pattern, the instances of the corresponding protein in species from each Baltimore class are restricted to a few compact clades. Proteins with the first pattern of distribution are products of so-called viral hallmark genes reported previously. Additionally, this pattern is displayed by the envelope glycoproteins from Flaviviridae and Bunyaviridae and helicases of superfamilies 1 and 2 that have homologs in cellular organisms. The second pattern can often be explained by horizontal gene transfer from the host or between viruses, an example being Orthomyxoviridae and Coronaviridae hemagglutinin esterases. Another facet of horizontal gene transfer comprises multiple independent introduction events of genes from cellular organisms into otherwise unrelated viruses.
Collapse
Affiliation(s)
- Silvia Caprari
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany.
| | - Saskia Metzler
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany.
- Saarbrücken Graduate School of Computer Science, University of Saarland, Campus E1 3, 66123 Saarbrücken, Germany.
| | - Thomas Lengauer
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany.
| | - Olga V Kalinina
- Department for Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany.
| |
Collapse
|
12
|
Rey J, Deschavanne P, Tuffery P. BactPepDB: a database of predicted peptides from a exhaustive survey of complete prokaryote genomes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau106. [PMID: 25377257 PMCID: PMC4221844 DOI: 10.1093/database/bau106] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
With the recent progress in complete genome sequencing, mining the increasing amount of genomic information available should in theory provide the means to discover new classes of peptides. However, annotation pipelines often do not consider small reading frames likely to be expressed. BactPepDB, available online at http://bactpepdb.rpbs.univ-paris-diderot.fr, is a database that aims at providing an exhaustive re-annotation of all complete prokaryotic genomes—chromosomal and plasmid DNA—available in RefSeq for coding sequences ranging between 10 and 80 amino acids. The identified peptides are classified as (i) previously identified in RefSeq, (ii) entity-overlapping (intragenic) or intergenic, and (iii) potential pseudogenes—intergenic sequences corresponding to a portion of a previously annotated larger gene. Additional information is related to homologs within order, predicted signal sequence, transmembrane segments, disulfide bonds, secondary structure, and the existence of a related 3D structure in the Protein Databank. As a result, BactPepDB provides insights about candidate peptides, and provides information about their conservation, together with some of their expected biological/structural features. The BactPepDB interface allows to search for candidate peptides in the database, or to search for peptides similar to a query, according to the multiple properties predicted or related to genomic localization. Database URL:http://www.yeastgenome.org/
Collapse
Affiliation(s)
- Julien Rey
- INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France
| | - Patrick Deschavanne
- INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France
| | - Pierre Tuffery
- INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France INSERM, U973, MTi, F-75205 Paris, France, Université Paris Diderot, Sorbonne Paris Cité, F-75205 Paris, France and RPBS, F-75205 Paris, France
| |
Collapse
|
13
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
14
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
15
|
Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, Bidnenko E, Marchadier E, Hoebeke M, Aymerich S, Becher D, Bisicchia P, Botella E, Delumeau O, Doherty G, Denham EL, Fogg MJ, Fromion V, Goelzer A, Hansen A, Härtig E, Harwood CR, Homuth G, Jarmer H, Jules M, Klipp E, Le Chat L, Lecointe F, Lewis P, Liebermeister W, March A, Mars RAT, Nannapaneni P, Noone D, Pohl S, Rinn B, Rügheimer F, Sappa PK, Samson F, Schaffer M, Schwikowski B, Steil L, Stülke J, Wiegert T, Devine KM, Wilkinson AJ, van Dijl JM, Hecker M, Völker U, Bessières P, Noirot P. Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis. Science 2012; 335:1103-6. [PMID: 22383849 DOI: 10.1126/science.1206848] [Citation(s) in RCA: 690] [Impact Index Per Article: 53.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Bacteria adapt to environmental stimuli by adjusting their transcriptomes in a complex manner, the full potential of which has yet to be established for any individual bacterial species. Here, we report the transcriptomes of Bacillus subtilis exposed to a wide range of environmental and nutritional conditions that the organism might encounter in nature. We comprehensively mapped transcription units (TUs) and grouped 2935 promoters into regulons controlled by various RNA polymerase sigma factors, accounting for ~66% of the observed variance in transcriptional activity. This global classification of promoters and detailed description of TUs revealed that a large proportion of the detected antisense RNAs arose from potentially spurious transcription initiation by alternative sigma factors and from imperfect control of transcription termination.
Collapse
Affiliation(s)
- Pierre Nicolas
- INRA, UR1077, Mathématique Informatique et Génome, Jouy-en-Josas, France
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Complete genome sequence of the fish pathogen Flavobacterium branchiophilum. Appl Environ Microbiol 2011; 77:7656-62. [PMID: 21926215 DOI: 10.1128/aem.05625-11] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Members of the genus Flavobacterium occur in a variety of ecological niches and represent an interesting diversity of lifestyles. Flavobacterium branchiophilum is the main causative agent of bacterial gill disease, a severe condition affecting various cultured freshwater fish species worldwide, in particular salmonids in Canada and Japan. We report here the complete genome sequence of strain FL-15 isolated from a diseased sheatfish (Silurus glanis) in Hungary. The analysis of the F. branchiophilum genome revealed putative mechanisms of pathogenicity strikingly different from those of the other, closely related fish pathogen Flavobacterium psychrophilum, including the first cholera-like toxin in a non-Proteobacteria and a wealth of adhesins. The comparison with available genomes of other Flavobacterium species revealed a small genome size, large differences in chromosome organization, and fewer rRNA and tRNA genes, in line with its more fastidious growth. In addition, horizontal gene transfer shaped the evolution of F. branchiophilum, as evidenced by its virulence factors, genomic islands, and CRISPR (clustered regularly interspaced short palindromic repeats) systems. Further functional analysis should help in the understanding of host-pathogen interactions and in the development of rational diagnostic tools and control strategies in fish farms.
Collapse
|
17
|
Abstract
PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com).
Collapse
Affiliation(s)
- You Zhou
- Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada T6G 2E8
| | | | | | | | | |
Collapse
|
18
|
Eng C, Thibessard A, Danielsen M, Rasmussen TB, Mari JF, Leblond P. In silico prediction of horizontal gene transfer in Streptococcus thermophilus. Arch Microbiol 2011; 193:287-97. [DOI: 10.1007/s00203-010-0671-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2010] [Revised: 12/20/2010] [Accepted: 12/21/2010] [Indexed: 10/18/2022]
|
19
|
Akakpo N. Estimating a discrete distribution viahistogram selection. ESAIM-PROBAB STAT 2011. [DOI: 10.1051/ps/2009007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
|
20
|
Smits WK, Grossman AD. The transcriptional regulator Rok binds A+T-rich DNA and is involved in repression of a mobile genetic element in Bacillus subtilis. PLoS Genet 2010; 6:e1001207. [PMID: 21085634 PMCID: PMC2978689 DOI: 10.1371/journal.pgen.1001207] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2010] [Accepted: 10/13/2010] [Indexed: 11/25/2022] Open
Abstract
The rok gene of Bacillus subtilis was identified as a negative regulator of competence development. It also controls expression of several genes not related to competence. We found that Rok binds to extended regions of the B. subtilis genome. These regions are characterized by a high A+T content and are known or believed to have been acquired by horizontal gene transfer. Some of the Rok binding regions are in known mobile genetic elements. A deletion of rok resulted in higher excision of one such element, ICEBs1, a conjugative transposon found integrated in the B. subtilis genome. When expressed in the Gram negative E. coli, Rok also associated with A+T-rich DNA and a conserved C-terminal region of Rok contributed to this association. Together with previous work, our findings indicate that Rok is a nucleoid associated protein that serves to help repress expression of A+T-rich genes, many of which appear to have been acquired by horizontal gene transfer. In these ways, Rok appears to be functionally analogous to H-NS, a nucleoid associated protein found in Gram negative bacteria and Lsr2 of high G+C Mycobacteria. There are several mechanisms by which bacteria acquire exogenous DNA. Sometimes this genetic material is advantageous for bacterial cells, for example, by making them resistant to antibiotics. Other times, foreign DNA has genes that are deleterious to the new host. Bacteria have mechanisms for helping to silence exogenously (horizontally) acquired genes. Many horizontally acquired genes are A+T-rich, a feature which can be important in distinguishing these loci from the host genes. We found that the transcriptional regulator Rok in the bacterium Bacillus subtilis preferentially binds to A+T-rich DNA. Together with previous work, our findings indicate that Rok helps repress expression of A+T-rich genes, many of which are likely to have been acquired by horizontal gene transfer. In these ways, Rok appears to be a functional analogue of the H-NS protein found in Gram negative bacteria (e.g., E. coli) and Lsr2 found in the high G+C Mycobacterium tuberculosis.
Collapse
Affiliation(s)
- Wiep Klaas Smits
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Alan D. Grossman
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
21
|
Irnov I, Sharma CM, Vogel J, Winkler WC. Identification of regulatory RNAs in Bacillus subtilis. Nucleic Acids Res 2010; 38:6637-51. [PMID: 20525796 PMCID: PMC2965217 DOI: 10.1093/nar/gkq454] [Citation(s) in RCA: 165] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2010] [Revised: 04/27/2010] [Accepted: 05/10/2010] [Indexed: 01/05/2023] Open
Abstract
Post-transcriptional regulatory mechanisms are widespread in bacteria. Interestingly, current published data hint that some of these mechanisms may be non-random with respect to their phylogenetic distribution. Although small, trans-acting regulatory RNAs commonly occur in bacterial genomes, they have been better characterized in Gram-negative bacteria, leaving the impression that they may be less important for Firmicutes. It has been presumed that Gram-positive bacteria, in particular the Firmicutes, are likely to utilize cis-acting regulatory RNAs located within the 5' mRNA leader region more often than trans-acting regulatory RNAs. In this analysis we catalog, by a deep sequencing-based approach, both classes of regulatory RNA candidates for Bacillus subtilis, the model microorganism for Firmicutes. We successfully recover most of the known small RNA regulators while also identifying a greater number of new candidate RNAs. We anticipate these data to be a broadly useful resource for analysis of post-transcriptional regulatory strategies in B. subtilis and other Firmicutes.
Collapse
Affiliation(s)
- Irnov Irnov
- Department of Biochemistry, The University of Texas Southwestern Medical Center, Dallas, TX, 75390-9038, USA and Max Planck Institute for Infection Biology, RNA Biology, Charitéplatz 1, D-10117 Berlin, Germany
| | - Cynthia M. Sharma
- Department of Biochemistry, The University of Texas Southwestern Medical Center, Dallas, TX, 75390-9038, USA and Max Planck Institute for Infection Biology, RNA Biology, Charitéplatz 1, D-10117 Berlin, Germany
| | - Jörg Vogel
- Department of Biochemistry, The University of Texas Southwestern Medical Center, Dallas, TX, 75390-9038, USA and Max Planck Institute for Infection Biology, RNA Biology, Charitéplatz 1, D-10117 Berlin, Germany
| | - Wade C. Winkler
- Department of Biochemistry, The University of Texas Southwestern Medical Center, Dallas, TX, 75390-9038, USA and Max Planck Institute for Infection Biology, RNA Biology, Charitéplatz 1, D-10117 Berlin, Germany
| |
Collapse
|
22
|
Wu H, Caffo B, Jaffee HA, Irizarry RA, Feinberg AP. Redefining CpG islands using hidden Markov models. Biostatistics 2010; 11:499-514. [PMID: 20212320 PMCID: PMC2883304 DOI: 10.1093/biostatistics/kxq005] [Citation(s) in RCA: 121] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2009] [Revised: 01/14/2010] [Accepted: 01/15/2010] [Indexed: 11/13/2022] Open
Abstract
The DNA of most vertebrates is depleted in CpG dinucleotide: a C followed by a G in the 5' to 3' direction. CpGs are the target for DNA methylation, a chemical modification of cytosine (C) heritable during cell division and the most well-characterized epigenetic mechanism. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). Knowing CGI locations is important because they mark functionally relevant epigenetic loci in development and disease. For various mammals, including human, a readily available and widely used list of CGI is available from the UCSC Genome Browser. This list was derived using algorithms that search for regions satisfying a definition of CGI proposed by Gardiner-Garden and Frommer more than 20 years ago. Recent findings, enabled by advances in technology that permit direct measurement of epigenetic endpoints at a whole-genome scale, motivate the need to adapt the current CGI definition. In this paper, we propose a procedure, guided by hidden Markov models, that permits an extensible approach to detecting CGI. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates, and the fact that we can create CGI lists that substantially increases overlap with recently discovered epigenetic marks. A CGI list and the probability scores, as a function of genome location, for each species are available at http://www.rafalab.org.
Collapse
Affiliation(s)
- Hao Wu
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| | | | | | | | | |
Collapse
|
23
|
Mallet LV, Becq J, Deschavanne P. Whole genome evaluation of horizontal transfers in the pathogenic fungus Aspergillus fumigatus. BMC Genomics 2010; 11:171. [PMID: 20226043 PMCID: PMC2848249 DOI: 10.1186/1471-2164-11-171] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Accepted: 03/12/2010] [Indexed: 12/14/2022] Open
Abstract
Background Numerous cases of horizontal transfers (HTs) have been described for eukaryote genomes, but in contrast to prokaryote genomes, no whole genome evaluation of HTs has been carried out. This is mainly due to a lack of parametric methods specially designed to take the intrinsic heterogeneity of eukaryote genomes into account. We applied a simple and tested method based on local variations of genomic signatures to analyze the genome of the pathogenic fungus Aspergillus fumigatus. Results We detected 189 atypical regions containing 214 genes, accounting for about 1 Mb of DNA sequences. However, the fraction of atypical DNA detected was smaller than the average amount detected in the same conditions in prokaryote genomes (3.1% vs 5.6%). It appeared that about one third of these regions contained no annotated genes, a proportion far greater than in prokaryote genomes. When analyzing the origin of these HTs by comparing their signatures to a home made database of species signatures, 3 groups of donor species emerged: bacteria (40%), fungi (25%), and viruses (22%). It is to be noticed that though inter-domain exchanges are confirmed, we only put in evidence very few exchanges between eukaryotic kingdoms. Conclusions In conclusion, we demonstrated that HTs are not negligible in eukaryote genomes, bearing in mind that in our stringent conditions this amount is a floor value, though of a lesser extent than in prokaryote genomes. The biological mechanisms underlying those transfers remain to be elucidated as well as the biological functions of the transferred genes.
Collapse
Affiliation(s)
- Ludovic V Mallet
- Molécules thérapeutiques in silico (MTI), INSERM UMR-M 973, Université Paris Diderot-Paris 7, Bât Lamarck, 35 rue Hélène Brion, 75205 Paris Cedex 13, France
| | | | | |
Collapse
|
24
|
Nuel G, Regad L, Martin J, Camproux AC. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithms Mol Biol 2010; 5:15. [PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 01/26/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Collapse
Affiliation(s)
- Gregory Nuel
- LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France
- CNRS, Paris, France
- MAP5, Department of Applied Mathematics, CNRS UMR-8145, University Paris Descartes, Paris, France
| | - Leslie Regad
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| | - Juliette Martin
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MIG, Mathématique Informatique et Genome, INRA UR-1077, Jouy-en-Josas, France
- IBCP, Institut de Biologie et Chimie des Protéines, IFR 128, CNRS UMR 5086, University of Lyon 1, Lyon, France
| | - Anne-Claude Camproux
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| |
Collapse
|
25
|
Rasmussen S, Nielsen HB, Jarmer H. The transcriptionally active regions in the genome of Bacillus subtilis. Mol Microbiol 2009; 73:1043-57. [PMID: 19682248 PMCID: PMC2784878 DOI: 10.1111/j.1365-2958.2009.06830.x] [Citation(s) in RCA: 130] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2009] [Indexed: 12/29/2022]
Abstract
The majority of all genes have so far been identified and annotated systematically through in silico gene finding. Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus subtilis by the use of tiling arrays. We have measured the genome-wide expression during mid-exponential growth on rich (LB) and minimal (M9) medium. The identified TARs account for 77.3% of the genes as they are currently annotated and additionally we find 84 putative non-coding RNAs (ncRNAs) and 127 antisense transcripts. One ncRNA, ncr22, is predicted to act as a translational control on cstA and an antisense transcript was observed opposite the housekeeping sigma factor sigA. Through this work we have discovered a long conserved 3' untranslated region (UTR) in a group of membrane-associated genes that is predicted to fold into a large and highly stable secondary structure. One of the genes having this tail is efeN, which encodes a target of the twin-arginine translocase (Tat) protein translocation system.
Collapse
Affiliation(s)
- Simon Rasmussen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark2800 Lyngby, Denmark
| | - Henrik Bjørn Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark2800 Lyngby, Denmark
| | - Hanne Jarmer
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark2800 Lyngby, Denmark
| |
Collapse
|
26
|
Eng C, Asthana C, Aigle B, Hergalant S, Mari JF, Leblond P. A New Data Mining Approach for the Detection of Bacterial Promoters Combining Stochastic and Combinatorial Methods. J Comput Biol 2009; 16:1211-25. [DOI: 10.1089/cmb.2008.0122] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Affiliation(s)
- Catherine Eng
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| | - Charu Asthana
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Bertrand Aigle
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| | - Sébastien Hergalant
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Jean-François Mari
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Pierre Leblond
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| |
Collapse
|
27
|
Barbe V, Cruveiller S, Kunst F, Lenoble P, Meurice G, Sekowska A, Vallenet D, Wang T, Moszer I, Médigue C, Danchin A. From a consortium sequence to a unified sequence: the Bacillus subtilis 168 reference genome a decade later. MICROBIOLOGY (READING, ENGLAND) 2009; 155:1758-1775. [PMID: 19383706 PMCID: PMC2885750 DOI: 10.1099/mic.0.027839-0] [Citation(s) in RCA: 266] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2009] [Revised: 02/25/2009] [Accepted: 02/25/2009] [Indexed: 11/18/2022]
Abstract
Comparative genomics is the cornerstone of identification of gene functions. The immense number of living organisms precludes experimental identification of functions except in a handful of model organisms. The bacterial domain is split into large branches, among which the Firmicutes occupy a considerable space. Bacillus subtilis has been the model of Firmicutes for decades and its genome has been a reference for more than 10 years. Sequencing the genome involved more than 30 laboratories, with different expertises, in a attempt to make the most of the experimental information that could be associated with the sequence. This had the expected drawback that the sequencing expertise was quite varied among the groups involved, especially at a time when sequencing genomes was extremely hard work. The recent development of very efficient, fast and accurate sequencing techniques, in parallel with the development of high-level annotation platforms, motivated the present resequencing work. The updated sequence has been reannotated in agreement with the UniProt protein knowledge base, keeping in perspective the split between the paleome (genes necessary for sustaining and perpetuating life) and the cenome (genes required for occupation of a niche, suggesting here that B. subtilis is an epiphyte). This should permit investigators to make reliable inferences to prepare validation experiments in a variety of domains of bacterial growth and development as well as build up accurate phylogenies.
Collapse
Affiliation(s)
- Valérie Barbe
- CEA, Institut de Génomique, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Stéphane Cruveiller
- CEA, Institut de Génomique, Laboratoire de Génomique Comparative/CNRS UMR8030, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Frank Kunst
- CEA, Institut de Génomique, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Patricia Lenoble
- CEA, Institut de Génomique, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Guillaume Meurice
- Institut Pasteur, Intégration et Analyse Génomiques, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Agnieszka Sekowska
- Institut Pasteur, Génétique des Génomes Bactériens/CNRS URA2171, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - David Vallenet
- CEA, Institut de Génomique, Laboratoire de Génomique Comparative/CNRS UMR8030, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Tingzhang Wang
- Institut Pasteur, Génétique des Génomes Bactériens/CNRS URA2171, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Ivan Moszer
- Institut Pasteur, Intégration et Analyse Génomiques, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Claudine Médigue
- CEA, Institut de Génomique, Laboratoire de Génomique Comparative/CNRS UMR8030, Génoscope, 2 rue Gaston Crémieux, 91057 Évry, France
| | - Antoine Danchin
- Institut Pasteur, Génétique des Génomes Bactériens/CNRS URA2171, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
28
|
Chambaz A, Matias C. Number of hidden states and memory: a joint order estimation problem for Markov chains with Markov regime. ESAIM-PROBAB STAT 2009. [DOI: 10.1051/ps:2007048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
|
29
|
Ibrahim M, Nicolas P, Bessières P, Bolotin A, Monnet V, Gardan R. A genome-wide survey of short coding sequences in streptococci. MICROBIOLOGY-SGM 2008; 153:3631-3644. [PMID: 17975071 DOI: 10.1099/mic.0.2007/006205-0] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Identification of short genes that encode peptides of fewer than 60 aa is challenging, both experimentally and in silico. As a consequence, the universe of these short coding sequences (CDSs) remains largely unknown, although some are acknowledged to play important roles in cell-cell communication, particularly in Gram-positive bacteria. This paper reports a thorough search for short CDSs across streptococcal genomes. Our bioinformatic approach relied on a combination of advanced intrinsic and extrinsic methods. In the first step, intrinsic sequence information (nucleotide composition and presence of RBSs) served to identify new short putative CDSs (spCDSs) and to eliminate the differences between annotation policies. In the second step, pseudogene fragments and false predictions were filtered out. The last step consisted of screening the remaining spCDSs for lines of extrinsic evidence involving sequence and gene-context comparisons. A total of 789 spCDSs across 20 complete genomes (19 Streptococcus and one Enterococcus) received the support of at least one line of extrinsic evidence, which corresponds to an average of 20 short CDSs per million base pairs. Most of these had no known function, and a significant fraction (31%) are not even annotated as hypothetical genes in GenBank records. As an illustration of the value of this list, we describe a new family of CDSs, encoding very short hydrophobic peptides (20-23 aa) situated just upstream of some of the positive transcriptional regulators of the Rgg family. The expression of seven other short CDSs from Streptococcus thermophilus CNRZ1066 that encode peptides ranging in length from 41 to 56 aa was confirmed by real-time quantitative RT-PCR and revealed a variety of expression patterns. Finally, one peptide from this list, encoded by a gene that is not annotated in GenBank, was identified in a cell-envelope-enriched fraction of S. thermophilus CNRZ1066.
Collapse
Affiliation(s)
- Mariam Ibrahim
- Unité de Biochimie Bactérienne, UR477, INRA, 78350 Jouy-en-Josas, France
| | - Pierre Nicolas
- Unité Mathématique Informatique et Génome, UR1077, INRA, 78350 Jouy-en-Josas, France
| | - Philippe Bessières
- Unité Mathématique Informatique et Génome, UR1077, INRA, 78350 Jouy-en-Josas, France
| | - Alexander Bolotin
- Unité de Génétique Microbienne, UR895, INRA, 78350 Jouy-en-Josas, France
| | - Véronique Monnet
- Unité de Biochimie Bactérienne, UR477, INRA, 78350 Jouy-en-Josas, France
| | - Rozenn Gardan
- Unité de Biochimie Bactérienne, UR477, INRA, 78350 Jouy-en-Josas, France
| |
Collapse
|
30
|
|
31
|
In silico segmentations of lentivirus envelope sequences. BMC Bioinformatics 2007; 8:99. [PMID: 17376229 PMCID: PMC1847453 DOI: 10.1186/1471-2105-8-99] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2006] [Accepted: 03/21/2007] [Indexed: 11/20/2022] Open
Abstract
Background The gene encoding the envelope of lentiviruses exhibits a considerable plasticity, particularly the region which encodes the surface (SU) glycoprotein. Interestingly, mutations do not appear uniformly along the sequence of SU, but they are clustered in restricted areas, called variable (V) regions, which are interspersed with relatively more stable regions, called constant (C) regions. We look for specific signatures of C/V regions, using hidden Markov models constructed with SU sequences of the equine, human, small ruminant and simian lentiviruses. Results Our models yield clear and accurate delimitations of the C/V regions, when the test set and the training set were made up of sequences of the same lentivirus, but also when they were made up of sequences of different lentiviruses. Interestingly, the models predicted the different regions of lentiviruses such as the bovine and feline lentiviruses, not used in the training set. Models based on composite training sets produce accurate segmentations of sequences of all these lentiviruses. Conclusion Our results suggest that each C/V region has a specific statistical oligonucleotide composition, and that the C (respectively V) regions of one of these lentiviruses are statistically more similar to the C (respectively V) regions of the other lentiviruses, than to the V (respectively C) regions of the same lentivirus.
Collapse
|
32
|
Thakur V, Azad RK, Ramaswamy R. Markov models of genome segmentation. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:011915. [PMID: 17358192 DOI: 10.1103/physreve.75.011915] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2006] [Revised: 06/19/2006] [Indexed: 05/14/2023]
Abstract
We introduce Markov models for segmentation of symbolic sequences, extending a segmentation procedure based on the Jensen-Shannon divergence that has been introduced earlier. Higher-order Markov models are more sensitive to the details of local patterns and in application to genome analysis, this makes it possible to segment a sequence at positions that are biologically meaningful. We show the advantage of higher-order Markov-model-based segmentation procedures in detecting compositional inhomogeneity in chimeric DNA sequences constructed from genomes of diverse species, and in application to the E. coli K12 genome, boundaries of genomic islands, cryptic prophages, and horizontally acquired regions are accurately identified.
Collapse
Affiliation(s)
- Vivek Thakur
- Center for Computational Biology and Bioinformatics, School of Information Technology, Jawaharlal Nehru University, New Delhi 110 067, India
| | | | | |
Collapse
|
33
|
Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 2006; 34:3533-45. [PMID: 16855290 PMCID: PMC1524909 DOI: 10.1093/nar/gkl471] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license.
Collapse
Affiliation(s)
| | | | | | | | - S. Chaillou
- Flore Lactique et Environnement Carné, INRA78352 Jouy-en-Josas Cedex, France
| | | | - S. Penaud
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | - E. Maguin
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | | | | | - J-F Gibrat
- To whom correspondence should be addressed. Tel: +33 1 34 65 28 97; Fax: +33 1 34 65 29 01; E-mail:
| |
Collapse
|
34
|
Waack S, Keller O, Asper R, Brodag T, Damm C, Fricke WF, Surovcik K, Meinicke P, Merkl R. Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinformatics 2006; 7:142. [PMID: 16542435 PMCID: PMC1489950 DOI: 10.1186/1471-2105-7-142] [Citation(s) in RCA: 279] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2005] [Accepted: 03/16/2006] [Indexed: 01/25/2023] Open
Abstract
Background Horizontal gene transfer (HGT) is considered a strong evolutionary force shaping the content of microbial genomes in a substantial manner. It is the difference in speed enabling the rapid adaptation to changing environmental demands that distinguishes HGT from gene genesis, duplications or mutations. For a precise characterization, algorithms are needed that identify transfer events with high reliability. Frequently, the transferred pieces of DNA have a considerable length, comprise several genes and are called genomic islands (GIs) or more specifically pathogenicity or symbiotic islands. Results We have implemented the program SIGI-HMM that predicts GIs and the putative donor of each individual alien gene. It is based on the analysis of codon usage (CU) of each individual gene of a genome under study. CU of each gene is compared against a carefully selected set of CU tables representing microbial donors or highly expressed genes. Multiple tests are used to identify putatively alien genes, to predict putative donors and to mask putatively highly expressed genes. Thus, we determine the states and emission probabilities of an inhomogeneous hidden Markov model working on gene level. For the transition probabilities, we draw upon classical test theory with the intention of integrating a sensitivity controller in a consistent manner. SIGI-HMM was written in JAVA and is publicly available. It accepts as input any file created according to the EMBL-format. It generates output in the common GFF format readable for genome browsers. Benchmark tests showed that the output of SIGI-HMM is in agreement with known findings. Its predictions were both consistent with annotated GIs and with predictions generated by different methods. Conclusion SIGI-HMM is a sensitive tool for the identification of GIs in microbial genomes. It allows to interactively analyze genomes in detail and to generate or to test hypotheses about the origin of acquired genes.
Collapse
Affiliation(s)
- Stephan Waack
- Institut für Informatik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Oliver Keller
- Institut für Informatik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Roman Asper
- Institut für Informatik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Thomas Brodag
- Institut für Informatik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Carsten Damm
- Institut für Numerische und Angewandte Mathematik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Wolfgang Florian Fricke
- Göttingen Genomics Laboratory, Universität Göttingen, Grisebachstr. 8, 37077 Göttingen, Germany
| | - Katharina Surovcik
- Institut für Informatik, Universität Göttingen, Lotzestr. 16–18, 37083 Göttingen, Germany
| | - Peter Meinicke
- Institut für Mikrobiologie und Genetik, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Rainer Merkl
- Institut für Biophysik und Physikalische Biochemie, Universität Regensburg, Universitätsstr. 31, 93053 Regensburg, Germany
| |
Collapse
|
35
|
Bekaert M, Richard H, Prum B, Rousset JP. Identification of programmed translational -1 frameshifting sites in the genome of Saccharomyces cerevisiae. Genome Res 2006; 15:1411-20. [PMID: 16204194 PMCID: PMC1240084 DOI: 10.1101/gr.4258005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Frameshifting is a recoding event that allows the expression of two polypeptides from the same mRNA molecule. Most recoding events described so far are used by viruses and transposons to express their replicase protein. The very few number of cellular proteins known to be expressed by a -1 ribosomal frameshifting has been identified by chance. The goal of the present work was to set up a systematic strategy, based on complementary bioinformatics, molecular biology, and functional approaches, without a priori knowledge of the mechanism involved. Two independent methods were devised. The first looks for genomic regions in which two ORFs, each carrying a protein pattern, are in a frameshifted arrangement. The second uses Hidden Markov Models and likelihood in a two-step approach. When this strategy was applied to the Saccharomyces cerevisiae genome, 189 candidate regions were found, of which 58 were further functionally investigated. Twenty-eight of them expressed a full-length mRNA covering the two ORFs, and 11 showed a -1 frameshift efficiency varying from 5% to 13% (50-fold higher than background), some of which corresponds to genes with known functions. From other ascomycetes, four frameshifted ORFs are found fully conserved. Strikingly, most of the candidates do not display a classical viral-like frameshift signal and would have escaped a search based on current models of frameshifting. These results strongly suggest that -1 frameshifting might be more widely distributed than previously thought.
Collapse
Affiliation(s)
- Michaël Bekaert
- Institut de Génétique et Microbiologie CNRS UMR 8621, Université Paris-Sud, 91405 Orsay Cedex, France
| | | | | | | |
Collapse
|
36
|
Merkl R. A comparative categorization of protein function encoded in bacterial or archeal genomic islands. J Mol Evol 2005; 62:1-14. [PMID: 16341468 DOI: 10.1007/s00239-004-0311-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2004] [Accepted: 06/14/2005] [Indexed: 01/11/2023]
Abstract
Genomes of prokaryotes harbor genomic islands (GIs), which are frequently acquired via horizontal gene transfer (HGT). Here I present an analysis of GIs with respect to gene-encoded functions. GIs were identified by statistical analysis of codon usage and clustering. Genes classified as putatively alien (pA) or putatively native (pN) were categorized according to the COG database. Among pA and pN genes, the distribution of COG functions and classes were studied for different groupings of prokaryotes. Groups were formed according to taxonomical relation or habitats. In all groups, genes related to class L (replication, recombination, and repair) were statistically significantly overrepresented in GIs. GIs of bacteria and archaea showed a distinct pattern of preferences. In archeal GIs, genes belonging to COG class M (cell wall/membrane/envelope biogenesis) or Q (secondary metabolites biosynthesis, transport, and catabolism) were more frequent. In bacterial GIs, genes of classes U (intracellular trafficking, secretion, and vesicular transport), N (cell motility), and V (defense mechanisms) were predominant. Underrepresentation was strongest for genes belonging to class J (translation, ribosomal structure, and biogenesis). Among single COG functions overrepresented in GIs were transferases and transporters. In both superkingdoms, HGT enhances genomic content by meeting demands that are independent of the studied habitats. These findings are in agreement with the complexity theory, which predicts the preferential import of operational genes. However, only specific subsets of operational genes were enriched in GIs. Modification of the cell envelope, cell motility, secretion, and protection of cellular DNA are major issues in HGT.
Collapse
Affiliation(s)
- Rainer Merkl
- Institut für Biophysik und physikalische Biochemie, Universität Regensburg, D-93040 Regensburg, Germany.
| |
Collapse
|
37
|
Carbone A, Madden R. Insights on the evolution of metabolic networks of unicellular translationally biased organisms from transcriptomic data and sequence analysis. J Mol Evol 2005; 61:456-69. [PMID: 16187158 DOI: 10.1007/s00239-004-0317-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2004] [Accepted: 04/20/2005] [Indexed: 11/27/2022]
Abstract
Codon bias is related to metabolic functions in translationally biased organisms, and two facts are argued about. First, genes with high codon bias describe in meaningful ways the metabolic characteristics of the organism; important metabolic pathways corresponding to crucial characteristics of the lifestyle of an organism, such as photosynthesis, nitrification, anaerobic versus aerobic respiration, sulfate reduction, methanogenesis, and others, happen to involve especially biased genes. Second, gene transcriptional levels of sets of experiments representing a significant variation of biological conditions strikingly confirm, in the case of Saccharomyces cerevisiae, that metabolic preferences are detectable by purely statistical analysis: the high metabolic activity of yeast during fermentation is encoded in the high bias of enzymes involved in the associated pathways, suggesting that this genome was affected by a strong evolutionary pressure that favored a predominantly fermentative metabolism of yeast in the wild. The ensemble of metabolic pathways involving enzymes with high codon bias is rather well defined and remains consistent across many species, even those that have not been considered as translationally biased, such as Helicobacter pylori, for instance, reveal some weak form of translational bias for this genome. We provide numerical evidence, supported by experimental data, of these facts and conclude that the metabolic networks of translationally biased genomes, observable today as projections of eons of evolutionary pressure, can be analyzed numerically and predictions of the role of specific pathways during evolution can be derived. The new concepts of Comparative Pathway Index, used to compare organisms with respect to their metabolic networks, and Evolutionary Pathway Index, used to detect evolutionarily meaningful bias in the genetic code from transcriptional data, are introduced.
Collapse
Affiliation(s)
- Alessandra Carbone
- Génomique Analytique, Université Pierre et Marie Curie, INSERM U511, 91 Bd de l'Hôpital, 75013 Paris, France.
| | | |
Collapse
|
38
|
Calteau A, Gouy M, Perrière G. Horizontal transfer of two operons coding for hydrogenases between bacteria and archaea. J Mol Evol 2005; 60:557-65. [PMID: 15983865 DOI: 10.1007/s00239-004-0094-8] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2004] [Accepted: 11/19/2004] [Indexed: 11/27/2022]
Abstract
Using a phylogenetic approach, we discovered three putative horizontal transfers between bacterial and archaeal species involving large clusters of genes. One transfer involves an operon of 13 genes, called mbx, which probably was transferred into the genome of Thermotoga maritima from a species belonging or close to the Pyrococcus genus. The two others implied an operon of six genes, called ech, transferred independently to the genomes of Thermoanaerobacter tengcongensis and Desulfovibrio gigas, from a species belonging or close to the Methanosarcina genus. All these transfers affected operons coding for multisubunit membrane-bound (NiFe) hydrogenases involved in the energy metabolism of the donor genomes. The functionality of the transferred operons has not been experimentally demonstrated for T. maritima, whereas in D. gigas and T. tengcongensis the encoded multisubunit hydrogenase could have a role in energy conservation. This report adds several cases of horizontal gene transfers among hydrogenases already described.
Collapse
Affiliation(s)
- Alexandra Calteau
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard--Lyon 1, Villeurbanne, France
| | | | | |
Collapse
|
39
|
Ledent S, Robin S. Checking homogeneity of motifs' distribution in heterogenous sequences. J Comput Biol 2005; 12:672-85. [PMID: 16108710 DOI: 10.1089/cmb.2005.12.672] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Studying the distribution of a motif along sequences may help in the understanding of its biological function, or to detect regions of interest. A statistical model is needed to assess the significance of the observed distribution. We propose a heterogenous compound Poisson process to model the possibility of overlap between occurrences and some heterogeneity of the sequence known a priori. The estimation procedure of the parameters is described and tests of homogenous sub-models are proposed. We also consider the detection of rich regions using either cumulated distances or moving intervals, via a homogenization technique. Illustrations of the method are given with applications to bacterial genomes.
Collapse
Affiliation(s)
- Sabrina Ledent
- Unité Mathématique, Informatique et Génome, Institut National de la Recherche Agronomique (INRA), F-78350 Jouy-en-Josas, France
| | | |
Collapse
|
40
|
Fertil B, Massin M, Lespinats S, Devic C, Dumee P, Giron A. GENSTYLE: exploration and analysis of DNA sequences with genomic signature. Nucleic Acids Res 2005; 33:W512-5. [PMID: 15980524 PMCID: PMC1160249 DOI: 10.1093/nar/gki489] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
GENSTYLE (http://Genstyle.imed.jussieu.fr) is a workspace designed for the characterization and classification of nucleotide sequences. Based on the genomic signature paradigm, GENSTYLE focuses on oligonucleotide frequencies in DNA sequences. Users can select sequences of interest in the GENSTYLE companion database, where the whole set of GenBank sequences is grouped per species, or upload their own sequences to work with. Tools for the exploration and analysis of signatures allow (i) identification of the origin of DNA segments (detection of rare species or species for which technical problems prevent fast characterization, such as micro-organisms with slow growth), (ii) analysis of the homogeneity of a genome and isolation of areas with novel functionality (horizontal transfers for example)--and (iii) molecular phylogeny and taxonomy.
Collapse
Affiliation(s)
- Bernard Fertil
- INSERM U. 678, 91 boulevard de l'Hôpital, 75634 Paris, France.
| | | | | | | | | | | |
Collapse
|
41
|
Abstract
Sarment is a package of Python modules for easy building and manipulation of sequence segmentations. It provides efficient implementation of usual algorithms for hidden Markov Model computation, as well as for maximal predictive partitioning. Owing to its very large variety of criteria for computing segmentations, Sarment can handle many kinds of models. Because of object-oriented programming, the results of the segmentation are very easy tomanipulate.
Collapse
Affiliation(s)
- Laurent Guéguen
- Laboratoire Biométrie et Biologie Evolutive, (UMR 5558); (NRS); Univ Lyon 1, 43 bd 11 Nov, 69622 Villeurbanne cedex, France.
| |
Collapse
|
42
|
Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res 2005; 33:e6. [PMID: 15653627 PMCID: PMC546175 DOI: 10.1093/nar/gni004] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Horizontal DNA transfer is an important factor of evolution and participates in biological diversity. Unfortunately, the location and length of horizontal transfers (HTs) are known for very few species. The usage of short oligonucleotides in a sequence (the so-called genomic signature) has been shown to be species-specific even in DNA fragments as short as 1 kb. The genomic signature is therefore proposed as a tool to detect HTs. Since DNA transfers originate from species with a signature different from those of the recipient species, the analysis of local variations of signature along recipient genome may allow for detecting exogenous DNA. The strategy consists in (i) scanning the genome with a sliding window, and calculating the corresponding local signature (ii) evaluating its deviation from the signature of the whole genome and (iii) looking for similar signatures in a database of genomic signatures. A total of 22 prokaryote genomes are analyzed in this way. It has been observed that atypical regions make up ∼6% of each genome on the average. Most of the claimed HTs as well as new ones are detected. The origin of putative DNA transfers is looked for among ∼12 000 species. Donor species are proposed and sometimes strongly suggested, considering similarity of signatures. Among the species studied, Bacillus subtilis, Haemophilus Influenzae and Escherichia coli are investigated by many authors and give the opportunity to perform a thorough comparison of most of the bioinformatics methods used to detect HTs.
Collapse
Affiliation(s)
| | | | | | | | - Patrick Deschavanne
- To whom correspondence should be addressed. Tel: 33 1 44 27 77 12; Fax: +33 1 43 26 38 30;
| |
Collapse
|
43
|
Qiu D, Fujita K, Sakuma Y, Tanaka T, Ohashi Y, Ohshima H, Tomita M, Itaya M. Comparative analysis of physical maps of four Bacillus subtilis (natto) genomes. Appl Environ Microbiol 2004; 70:6247-56. [PMID: 15466572 PMCID: PMC522138 DOI: 10.1128/aem.70.10.6247-6256.2004] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2003] [Accepted: 06/10/2004] [Indexed: 11/20/2022] Open
Abstract
The complete SfiI and I-CeuI physical maps of four Bacillus subtilis (natto) strains, which were previously isolated as natto (fermented soybean) starters, were constructed to elucidate the genome structure. Not only the similarity in genome size and organization but also the microheterogeneity of the gene context was revealed. No large-scale genome rearrangements among the four strains were indicated by mapping of the genes, including 10 rRNA operons (rrn) and relevant genes required for natto production, to the loci corresponding to those of the B. subtilis strain Marburg 168. However, restriction fragment length polymorphism and the presence or absence of strain-specific DNA sequences, such as the prophages SP beta, skin element, and PBSX, as well as the insertion element IS4Bsu1, could be used to identify one of these strains as a Marburg type and the other three strains as natto types. The genome structure and gene heterogeneity were also consistent with the type of indigenous plasmids harbored by the strains.
Collapse
Affiliation(s)
- Dongru Qiu
- Institute for Advanced Biosciences and Bioinformatics Program, Keio University, 403-1 Nipponkoku, Daihoji, Tsuruoka, Yamagata 997-0017, Japan
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Speed T. Discussions on “A Bayesian Approach to DNA Sequence Segmentation”. Biometrics 2004. [DOI: 10.1111/j.0006-341x.2004.206_4.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
45
|
Abstract
Many deoxyribonucleic acid (DNA) sequences display compositional heterogeneity in the form of segments of similar structure. This article describes a Bayesian method that identifies such segments by using a Markov chain governed by a hidden Markov model. Markov chain Monte Carlo (MCMC) techniques are employed to compute all posterior quantities of interest and, in particular, allow inferences to be made regarding the number of segment types and the order of Markov dependence in the DNA sequence. The method is applied to the segmentation of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.
Collapse
Affiliation(s)
- Richard J Boys
- School of Mathematics and Statistics, Newcastle University, Newcastle upon Tyne, UK.
| | | |
Collapse
|
46
|
Merkl R. SIGI: score-based identification of genomic islands. BMC Bioinformatics 2004; 5:22. [PMID: 15113412 PMCID: PMC394314 DOI: 10.1186/1471-2105-5-22] [Citation(s) in RCA: 70] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2003] [Accepted: 03/03/2004] [Indexed: 01/03/2023] Open
Abstract
Background Genomic islands can be observed in many microbial genomes. These stretches of DNA have a conspicuous composition with regard to sequence or encoded functions. Genomic islands are assumed to be frequently acquired via horizontal gene transfer. For the analysis of genome structure and the study of horizontal gene transfer, it is necessary to reliably identify and characterize these islands. Results A scoring scheme on codon frequencies Score_G1G2(cdn) = log(f_G2(cdn) / f_G1(cdn)) was utilized. To analyse genes of a species G1 and to test their relatedness to species G2, scores were determined by applying the formula to log-odds derived from mean codon frequencies of the two genomes. A non-redundant set of nearly 400 codon usage tables comprising microbial species was derived; its members were used alternatively at position G2. Genes having at least one score value above a species-specific and dynamically determined cut-off value were analysed further. By means of cluster analysis, genes were identified that comprise clusters of statistically significant size. These clusters were predicted as genomic islands. Finally and individually for each of these genes, the taxonomical relation among those species responsible for significant scores was interpreted. The validity of the approach and its limitations were made plausible by an extensive analysis of natural genes and synthetic ones aimed at modelling the process of gene amelioration. Conclusions The method reliably allows to identify genomic island and the likely origin of alien genes.
Collapse
Affiliation(s)
- Rainer Merkl
- Abteilung Molekulare Genetik und Präparative Molekularbiologie, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen and Göttingen Genomics Laboratory, Grisebachstr, 8, 37077 Göttingen, Germany.
| |
Collapse
|
47
|
Samuels DC, Boys RJ, Henderson DA, Chinnery PF. A compositional segmentation of the human mitochondrial genome is related to heterogeneities in the guanine mutation rate. Nucleic Acids Res 2003; 31:6043-52. [PMID: 14530452 PMCID: PMC219467 DOI: 10.1093/nar/gkg784] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2003] [Revised: 08/01/2003] [Accepted: 08/20/2003] [Indexed: 11/12/2022] Open
Abstract
We applied a hidden Markov model segmentation method to the human mitochondrial genome to identify patterns in the sequence, to compare these patterns to the gene structure of mtDNA and to see whether these patterns reveal additional characteristics important for our understanding of genome evolution, structure and function. Our analysis identified three segmentation categories based upon the sequence transition probabilities. Category 2 segments corresponded to the tRNA and rRNA genes, with a greater strand-symmetry in these segments. Category 1 and 3 segments covered the protein- coding genes and almost all of the non-coding D-loop. Compared to category 1, the mtDNA segments assigned to category 3 had much lower guanine abundance. A comparison to two independent databases of mitochondrial mutations and polymorphisms showed that the high substitution rate of guanine in human mtDNA is largest in the category 3 segments. Analysis of synonymous mutations showed the same pattern. This suggests that this heterogeneity in the mutation rate is partly independent of respiratory chain function and is a direct property of the genome sequence itself. This has important implications for our understanding of mtDNA evolution and its use as a 'molecular clock' to determine the rate of population and species divergence.
Collapse
Affiliation(s)
- David C Samuels
- Virginia Bioinformatics Institute, Virginia Polytechnic and State University, Blacksburg, VA 24061, USA.
| | | | | | | |
Collapse
|
48
|
Sandberg R, Bränden CI, Ernberg I, Cöster J. Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene 2003; 311:35-42. [PMID: 12853136 DOI: 10.1016/s0378-1119(03)00581-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Each prokaryote has a unique genomic signature as evidenced by a set of species-specific frequencies of short oligonucleotides. With respect to genomic signatures a bacterial genome is homogenous and the variation within a genome is smaller than the variations between genomes of different species. This study quantifies the species-specificity of genomic signatures in the complete genomes of 57 prokaryotes. The species-specificity in the genomic signature was related to the quantification of other sequence biases, such as G+C content, synonymous codon choice and amino acid usage. The results confirm that the genomic signature is genome-wide with high species-specificity in both coding and non-coding regions. In coding regions the species-specific bias in synonymous codon choice was comparable to the genomic signature, while the bias in amino acid usage only captured about 50% of the species-specific bias in the genomic signature. A correlation between the species-specificity in synonymous codon choice and amino acid usage was identified, in which proteins with species-specific amino acid usage were also coded with species-specific synonymous codon choice. However, we demonstrated that the G+C content captures only approximately 40% of the species-specificity in the genomic signature, and is insufficient to explain the species specificity in the non-coding regions. Thus, the species-specific bias in non-coding regions remains largely unknown. Further, we compared the genomic signature in relation to phylogenetic distance. This was performed in order to illustrate the feasibility of a hierarchical classification scheme in future applications of the described classification methodology in screening for horizontal gene transfer and biodiversity studies.
Collapse
Affiliation(s)
- Rickard Sandberg
- Microbiology and Tumor Biology Center, Karolinska Institute, S-171 77 Stockholm, Sweden.
| | | | | | | |
Collapse
|
49
|
Abstract
It is probable that, increasingly, genome investigations are going to be based on statistical formalization. This review summarizes the state of art and potentiality of using statistics in microbial genome analysis. First, I focus on recent advances in functional genomics, such as finding genes and operons, identifying gene conversion events, detecting DNA replication origins and analysing regulatory sites. Then I describe how to use phylogenetic methods in genome analysis and methods for genome-wide scanning for positively selected amino acids. I conclude with speculations on the future course of genome statistical modeling.
Collapse
Affiliation(s)
- Pietro Liò
- Department of Zoology, University of Cambridge, UK.
| |
Collapse
|
50
|
Klaenhammer T, Altermann E, Arigoni F, Bolotin A, Breidt F, Broadbent J, Cano R, Chaillou S, Deutscher J, Gasson M, van de Guchte M, Guzzo J, Hartke A, Hawkins T, Hols P, Hutkins R, Kleerebezem M, Kok J, Kuipers O, Lubbers M, Maguin E, McKay L, Mills D, Nauta A, Overbeek R, Pel H, Pridmore D, Saier M, van Sinderen D, Sorokin A, Steele J, O'Sullivan D, de Vos W, Weimer B, Zagorec M, Siezen R. Discovering lactic acid bacteria by genomics. Antonie Van Leeuwenhoek 2002; 82:29-58. [PMID: 12369195 DOI: 10.1007/978-94-017-2029-8_3] [Citation(s) in RCA: 63] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
This review summarizes a collection of lactic acid bacteria that are now undergoing genomic sequencing and analysis. Summaries are presented on twenty different species, with each overview discussing the organisms fundamental and practical significance, environmental habitat, and its role in fermentation, bioprocessing, or probiotics. For those projects where genome sequence data were available by March 2002, summaries include a listing of key statistics and interesting genomic features. These efforts will revolutionize our molecular view of Gram-positive bacteria, as up to 15 genomes from the low GC content lactic acid bacteria are expected to be available in the public domain by the end of 2003. Our collective view of the lactic acid bacteria will be fundamentally changed as we rediscover the relationships and capabilities of these organisms through genomics.
Collapse
Affiliation(s)
- Todd Klaenhammer
- Department of Food Science, Southeast Dairy Foods Research Center, North Carolina State University, Raleigh, NC 27695-7624, USA. ,
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|