1
|
Ailloud F, Gottschall W, Suerbaum S. Methylome evolution suggests lineage-dependent selection in the gastric pathogen Helicobacter pylori. Commun Biol 2023; 6:839. [PMID: 37573385 PMCID: PMC10423294 DOI: 10.1038/s42003-023-05218-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 08/04/2023] [Indexed: 08/14/2023] Open
Abstract
The bacterial pathogen Helicobacter pylori, the leading cause of gastric cancer, is genetically highly diverse and harbours a large and variable portfolio of restriction-modification systems. Our understanding of the evolution and function of DNA methylation in bacteria is limited. Here, we performed a comprehensive analysis of the methylome diversity in H. pylori, using a dataset of 541 genomes that included all known phylogeographic populations. The frequency of 96 methyltransferases and the abundance of their cognate recognition sequences were strongly influenced by phylogeographic structure and were inter-correlated, positively or negatively, for 20% of type II methyltransferases. Low density motifs were more likely to be affected by natural selection, as reflected by higher genomic instability and compositional bias. Importantly, direct correlation implied that methylation patterns can be actively enriched by positive selection and suggests that specific sites have important functions in methylation-dependent phenotypes. Finally, we identified lineage-specific selective pressures modulating the contraction and expansion of the motif ACGT, revealing that the genetic load of methylation could be dependent on local ecological factors. Taken together, natural selection may shape both the abundance and distribution of methyltransferases and their specific recognition sequences, likely permitting a fine-tuning of genome-encoded functions not achievable by genetic variation alone.
Collapse
Affiliation(s)
- Florent Ailloud
- Medical Microbiology and Hospital Epidemiology, Max von Pettenkofer Institute, Faculty of Medicine, LMU Munich, Munich, Germany.
- German Center for Infection Research (DZIF), Partner Site Munich, Munich, Germany.
| | - Wilhelm Gottschall
- Medical Microbiology and Hospital Epidemiology, Max von Pettenkofer Institute, Faculty of Medicine, LMU Munich, Munich, Germany
| | - Sebastian Suerbaum
- Medical Microbiology and Hospital Epidemiology, Max von Pettenkofer Institute, Faculty of Medicine, LMU Munich, Munich, Germany.
- German Center for Infection Research (DZIF), Partner Site Munich, Munich, Germany.
| |
Collapse
|
2
|
Callens M, Pradier L, Finnegan M, Rose C, Bedhomme S. Read between the lines: Diversity of non-translational selection pressures on local codon usage. Genome Biol Evol 2021; 13:6263832. [PMID: 33944930 PMCID: PMC8410138 DOI: 10.1093/gbe/evab097] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/28/2021] [Indexed: 12/14/2022] Open
Abstract
Protein coding genes can contain specific motifs within their nucleotide sequence that function as a signal for various biological pathways. The presence of such sequence motifs within a gene can have beneficial or detrimental effects on the phenotype and fitness of an organism, and this can lead to the enrichment or avoidance of this sequence motif. The degeneracy of the genetic code allows for the existence of alternative synonymous sequences that exclude or include these motifs, while keeping the encoded amino acid sequence intact. This implies that locally, there can be a selective pressure for preferentially using a codon over its synonymous alternative in order to avoid or enrich a specific sequence motif. This selective pressure could -in addition to mutation, drift and selection for translation efficiency and accuracy- contribute to shape the codon usage bias. In this review, we discuss patterns of avoidance of (or enrichment for) the various biological signals contained in specific nucleotide sequence motifs: transcription and translation initiation and termination signals, mRNA maturation signals, and antiviral immune system targets. Experimental data on the phenotypic or fitness effects of synonymous mutations in these sequence motifs confirm that they can be targets of local selection pressures on codon usage. We also formulate the hypothesis that transposable elements could have a similar impact on codon usage through their preferred integration sequences. Overall, selection on codon usage appears to be a combination of a global selection pressure imposed by the translation machinery, and a patchwork of local selection pressures related to biological signals contained in specific sequence motifs.
Collapse
Affiliation(s)
- Martijn Callens
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Léa Pradier
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Michael Finnegan
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Caroline Rose
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| | - Stéphanie Bedhomme
- Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, Université de Montpellier, Université Paul Valéry Montpellier 3, Ecole Pratique des Hautes Etudes, Institut de Recherche pour le Développement, 34000 Montpellier, France
| |
Collapse
|
3
|
Xu H, Zhao Y, Wu X, Wu Z. Quick assessment of the potato chip crispness using the mechanical-acoustic measurement method. INTERNATIONAL JOURNAL OF FOOD ENGINEERING 2020. [DOI: 10.1515/ijfe-2020-0135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractTraditional assessment method for the food crispness was sensory analysis which was time consuming and needed experienced panelists. Aiming to to develop a quick evaluation of the food crispness, a mechanical-acoustic testing method was proposed where two parameters-maximum force (Fmax) and maximum acoustic energy in unit time (SEmax) were applied to assess the crispness of dried potato chips. It was found the mechanical-acoustic testing was completed in about 1.2 s and the potato chips had a statistic distributions for Fmax and SEmax. The brand A potato chips had a statistic average Fmax of 13.48 N and SEmax of 93.51 mV·ms. Three kinds of potato chips can be effectively differentiated according to the statistic average SEmax and Fmax. Sensory “crispness” had a good correlation with the statistic average SEmax. This work shows that it is feasible for a quick measurement of the food crispness using this mechanical-acoustic method.
Collapse
Affiliation(s)
- Huili Xu
- College of Mechanical Engineering, Tianjin Key Laboratory of Integrated Design and On-line Monitoring for Light Industry & Food Machinery and Equipment, Tianjin University of Science and Technology, Tianjin, China
| | - Yong Zhao
- College of Mechanical Engineering, Tianjin Key Laboratory of Integrated Design and On-line Monitoring for Light Industry & Food Machinery and Equipment, Tianjin University of Science and Technology, Tianjin, China
| | - Xuyao Wu
- College of Mechanical Engineering, Tianjin Key Laboratory of Integrated Design and On-line Monitoring for Light Industry & Food Machinery and Equipment, Tianjin University of Science and Technology, Tianjin, China
| | - Zhonghua Wu
- College of Mechanical Engineering, Tianjin Key Laboratory of Integrated Design and On-line Monitoring for Light Industry & Food Machinery and Equipment, Tianjin University of Science and Technology, Tianjin, China
- International Science and Technology Cooperation Base of Low-Carbon Green Process Equipment, Tianjin, China
| |
Collapse
|
4
|
Van Leuven JT, Ederer MM, Burleigh K, Scott L, Hughes RA, Codrea V, Ellington AD, Wichman HA, Miller CR. ΦX174 Attenuation by Whole-Genome Codon Deoptimization. Genome Biol Evol 2020; 13:5921183. [PMID: 33045052 PMCID: PMC7881332 DOI: 10.1093/gbe/evaa214] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/07/2020] [Indexed: 12/11/2022] Open
Abstract
Natural selection acting on synonymous mutations in protein-coding genes influences genome composition and evolution. In viruses, introducing synonymous mutations in genes encoding structural proteins can drastically reduce viral growth, providing a means to generate potent, live-attenuated vaccine candidates. However, an improved understanding of what compositional features are under selection and how combinations of synonymous mutations affect viral growth is needed to predictably attenuate viruses and make them resistant to reversion. We systematically recoded all nonoverlapping genes of the bacteriophage ΦX174 with codons rarely used in its Escherichia coli host. The fitness of recombinant viruses decreases as additional deoptimizing mutations are made to the genome, although not always linearly, and not consistently across genes. Combining deoptimizing mutations may reduce viral fitness more or less than expected from the effect size of the constituent mutations and we point out difficulties in untangling correlated compositional features. We test our model by optimizing the same genes and find that the relationship between codon usage and fitness does not hold for optimization, suggesting that wild-type ΦX174 is at a fitness optimum. This work highlights the need to better understand how selection acts on patterns of synonymous codon usage across the genome and provides a convenient system to investigate the genetic determinants of virulence.
Collapse
Affiliation(s)
- James T Van Leuven
- Department of Biological Science, University of Idaho.,Institute for Modeling Collaboration and Innovation, University of Idaho
| | | | - Katelyn Burleigh
- Department of Biological Science, University of Idaho.,Present address: Seattle Children's Research Institute, Seattle, WA
| | - LuAnn Scott
- Department of Biological Science, University of Idaho
| | - Randall A Hughes
- Applied Research Laboratories, University of Texas, Austin.,Present address: Biotechnology Branch, CCDC US Army Research Laboratory, Adelphi, MD
| | - Vlad Codrea
- Institute for Cellular and Molecular Biology, University of Texas, Austin
| | - Andrew D Ellington
- Applied Research Laboratories, University of Texas, Austin.,Institute for Cellular and Molecular Biology, University of Texas, Austin
| | - Holly A Wichman
- Department of Biological Science, University of Idaho.,Institute for Modeling Collaboration and Innovation, University of Idaho
| | - Craig R Miller
- Department of Biological Science, University of Idaho.,Institute for Modeling Collaboration and Innovation, University of Idaho
| |
Collapse
|
5
|
Zarai Y, Zafrir Z, Siridechadilok B, Suphatrakul A, Roopin M, Julander J, Tuller T. Evolutionary selection against short nucleotide sequences in viruses and their related hosts. DNA Res 2020; 27:dsaa008. [PMID: 32339222 PMCID: PMC7320823 DOI: 10.1093/dnares/dsaa008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 04/20/2020] [Indexed: 11/13/2022] Open
Abstract
Viruses are under constant evolutionary pressure to effectively interact with the host intracellular factors, while evading its immune system. Understanding how viruses co-evolve with their hosts is a fundamental topic in molecular evolution and may also aid in developing novel viral based applications such as vaccines, oncologic therapies, and anti-bacterial treatments. Here, based on a novel statistical framework and a large-scale genomic analysis of 2,625 viruses from all classes infecting 439 host organisms from all kingdoms of life, we identify short nucleotide sequences that are under-represented in the coding regions of viruses and their hosts. These sequences cannot be explained by the coding regions' amino acid content, codon, and dinucleotide frequencies. We specifically show that short homooligonucleotide and palindromic sequences tend to be under-represented in many viruses probably due to their effect on gene expression regulation and the interaction with the host immune system. In addition, we show that more sequences tend to be under-represented in dsDNA viruses than in other viral groups. Finally, we demonstrate, based on in vitro and in vivo experiments, how under-represented sequences can be used to attenuated Zika virus strains.
Collapse
Affiliation(s)
- Yoram Zarai
- Biomedical Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel
| | - Zohar Zafrir
- Biomedical Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel
- SynVaccine Ltd., Ramat Hachayal, Tel Aviv, Israel
| | | | - Amporn Suphatrakul
- National Center for Genetic Engineering and Biotechnology, Pathumthani 12120, Thailand
| | - Modi Roopin
- Biomedical Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel
- SynVaccine Ltd., Ramat Hachayal, Tel Aviv, Israel
| | - Justin Julander
- Institute for Antiviral Research, Utah State University, Logan, UT, USA
| | - Tamir Tuller
- Biomedical Engineering Department, Tel Aviv University, Tel Aviv 69978, Israel
- SynVaccine Ltd., Ramat Hachayal, Tel Aviv, Israel
| |
Collapse
|
6
|
Asymptotic Analysis of the kth Subword Complexity. ENTROPY 2020; 22:e22020207. [PMID: 33285983 PMCID: PMC7516637 DOI: 10.3390/e22020207] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 01/28/2020] [Accepted: 02/04/2020] [Indexed: 11/18/2022]
Abstract
Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k=Θ(logn). In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.
Collapse
|
7
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
8
|
Jariah ROA, Hakim MS. Interaction of phages, bacteria, and the human immune system: Evolutionary changes in phage therapy. Rev Med Virol 2019; 29:e2055. [PMID: 31145517 DOI: 10.1002/rmv.2055] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Revised: 05/01/2019] [Accepted: 05/02/2019] [Indexed: 12/26/2022]
Abstract
Phages and bacteria are known to undergo dynamic and co-evolutionary arms race interactions in order to survive. Recent advances from in vitro and in vivo studies have improved our understanding of the complex interactions between phages, bacteria, and the human immune system. This insight is essential for the development of phage therapy to battle the growing problems of antibiotic resistance. It is also pivotal to prevent the development of phage-resistance during the implementation of phage therapy in the clinic. In this review, we discuss recent progress of the interactions between phages, bacteria, and the human immune system and its clinical application for phage therapy. Proper phage therapy design will ideally produce large burst sizes, short latent periods, broad host ranges, and a low tendency to select resistance.
Collapse
Affiliation(s)
- Rizka O A Jariah
- Department of Health Science, Faculty of Vocational Studies, Universitas Airlangga, Surabaya, Indonesia
| | - Mohamad S Hakim
- Department of Microbiology, Faculty of Medicine, Public Health and Nursing, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
9
|
Brownell D, King J, Caliando B, Sycheva L, Koeris M. Engineering Bacteriophage-Based Biosensors. Methods Mol Biol 2019; 1898:37-50. [PMID: 30570721 DOI: 10.1007/978-1-4939-8940-9_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Bacteriophages have been used for diagnostic purposes in the past, but a lack of parallelizable engineering methods had limited their applicability to a narrow subset of diagnostic settings. More recently, however, advances in DNA sequencing and the introduction of more sensitive reporter systems have enabled novel engineering methods, which in turn have broadened the scope of modern phage diagnostics. Here we describe advanced methods to engineer the genomes of bacteriophages in a modular and rapid fashion.
Collapse
Affiliation(s)
- Daniel Brownell
- Sample 6 Technologies, 15300 Bothell Way NE Lake Forest Park, WA, 98155, Woburn, MA, USA
| | - John King
- Sample 6 Technologies, 15300 Bothell Way NE Lake Forest Park, WA, 98155, Woburn, MA, USA
| | - Brian Caliando
- Sample 6 Technologies, 15300 Bothell Way NE Lake Forest Park, WA, 98155, Woburn, MA, USA
| | - Lada Sycheva
- Sample 6 Technologies, 15300 Bothell Way NE Lake Forest Park, WA, 98155, Woburn, MA, USA
| | - Michael Koeris
- Sample 6 Technologies, 15300 Bothell Way NE Lake Forest Park, WA, 98155, Woburn, MA, USA.
| |
Collapse
|
10
|
Rusinov IS, Ershova AS, Karyagina AS, Spirin SA, Alexeevski AV. Avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses. BMC Genomics 2018; 19:885. [PMID: 30526500 PMCID: PMC6286503 DOI: 10.1186/s12864-018-5324-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 11/28/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Restriction-modification (R-M) systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. An R-M system specifically recognizes short sites in foreign DNA and cleaves it, while such sites in the host DNA are protected by methylation. Prokaryotic viruses have developed a number of strategies to overcome this host defense. The simplest anti-restriction strategy is the elimination of recognition sites in the viral genome: no sites, no DNA cleavage. Even a decrease of the number of recognition sites can help a virus to overcome this type of host defense. Recognition site avoidance has been a known anti-restriction strategy of prokaryotic viruses for decades. However, recognition site avoidance has not been systematically studied with the currently available sequence data. We analyzed the complete genomes of almost 4000 prokaryotic viruses with known host species and more than 17,000 restriction endonucleases with known specificities in terms of recognition site avoidance. RESULTS We observed considerable limitations of recognition site avoidance as an anti-restriction strategy. Namely, the avoidance of recognition sites is specific for dsDNA and ssDNA prokaryotic viruses. Avoidance is much more pronounced in the genomes of non-temperate bacteriophages than in the genomes of temperate ones. Avoidance is not observed for the sites of Type I and Type IIG systems and is very rarely observed for the sites of Type III systems. The vast majority of avoidance cases concern recognition sites of orthodox Type II restriction-modification systems. Even under these constraints, complete or almost complete elimination of sites is observed for approximately one-tenth of viral genomes and a significant under-representation for approximately one-fourth of them. CONCLUSIONS Avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses.
Collapse
Affiliation(s)
- I S Rusinov
- Belozersky Institute of Physical and Chemical Biology, Lomonosov Moscow State University, 119992, Moscow, Russia
| | - A S Ershova
- Belozersky Institute of Physical and Chemical Biology, Lomonosov Moscow State University, 119992, Moscow, Russia.,Gamaleya National Research Center of Epidemiology and Microbiology of the Ministry of Health of the Russian Federation, 123098, Moscow, Russia.,All-Russia Research Institute of Agricultural Biotechnology, 127550, Moscow, Russia
| | - A S Karyagina
- Belozersky Institute of Physical and Chemical Biology, Lomonosov Moscow State University, 119992, Moscow, Russia.,Gamaleya National Research Center of Epidemiology and Microbiology of the Ministry of Health of the Russian Federation, 123098, Moscow, Russia.,All-Russia Research Institute of Agricultural Biotechnology, 127550, Moscow, Russia
| | - S A Spirin
- Belozersky Institute of Physical and Chemical Biology, Lomonosov Moscow State University, 119992, Moscow, Russia.,Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia.,National Research University Higher School of Economics, 101000, Moscow, Russia.,Institute of System Studies, 117281, Moscow, Russia
| | - A V Alexeevski
- Belozersky Institute of Physical and Chemical Biology, Lomonosov Moscow State University, 119992, Moscow, Russia. .,Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia. .,Institute of System Studies, 117281, Moscow, Russia.
| |
Collapse
|
11
|
Pleška M, Guet CC. Effects of mutations in phage restriction sites during escape from restriction-modification. Biol Lett 2018; 13:rsbl.2017.0646. [PMID: 29237814 DOI: 10.1098/rsbl.2017.0646] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2017] [Accepted: 11/15/2017] [Indexed: 01/21/2023] Open
Abstract
Restriction-modification systems are widespread genetic elements that protect bacteria from bacteriophage infections by recognizing and cleaving heterologous DNA at short, well-defined sequences called restriction sites. Bioinformatic evidence shows that restriction sites are significantly underrepresented in bacteriophage genomes, presumably because bacteriophages with fewer restriction sites are more likely to escape cleavage by restriction-modification systems. However, how mutations in restriction sites affect the likelihood of bacteriophage escape is unknown. Using the bacteriophage λ and the restriction-modification system EcoRI, we show that while mutation effects at different restriction sites are unequal, they are independent. As a result, the probability of bacteriophage escape increases with each mutated restriction site. Our results experimentally support the role of restriction site avoidance as a response to selection imposed by restriction-modification systems and offer an insight into the events underlying the process of bacteriophage escape.
Collapse
Affiliation(s)
- Maroš Pleška
- Institute of Science and Technology Austria, Am Campus 1, Klosterneuburg 3400, Austria
| | - Călin C Guet
- Institute of Science and Technology Austria, Am Campus 1, Klosterneuburg 3400, Austria
| |
Collapse
|
12
|
Kondrashov A, Duc Hoang M, Smith JGW, Bhagwan JR, Duncan G, Mosqueira D, Munoz MB, Vo NTN, Denning C. Simplified Footprint-Free Cas9/CRISPR Editing of Cardiac-Associated Genes in Human Pluripotent Stem Cells. Stem Cells Dev 2018; 27:391-404. [PMID: 29402189 PMCID: PMC5882176 DOI: 10.1089/scd.2017.0268] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Modeling disease with human pluripotent stem cells (hPSCs) is hindered because the impact on cell phenotype from genetic variability between individuals can be greater than from the pathogenic mutation. While “footprint-free” Cas9/CRISPR editing solves this issue, existing approaches are inefficient or lengthy. In this study, a simplified PiggyBac strategy shortened hPSC editing by 2 weeks and required one round of clonal expansion and genotyping rather than two, with similar efficiencies to the longer conventional process. Success was shown across four cardiac-associated loci (ADRB2, GRK5, RYR2, and ACTC1) by genomic cleavage and editing efficiencies of 8%–93% and 8%–67%, respectively, including mono- and/or biallelic events. Pluripotency was retained, as was differentiation into high-purity cardiomyocytes (CMs; 88%–99%). Using the GRK5 isogenic lines as an exemplar, chronic stimulation with the β-adrenoceptor agonist, isoprenaline, reduced beat rate in hPSC-CMs expressing GRK5-Q41 but not GRK5-L41; this was reversed by the β-blocker, propranolol. This shortened, footprint-free approach will be useful for mechanistic studies.
Collapse
Affiliation(s)
- Alexander Kondrashov
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Minh Duc Hoang
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - James G W Smith
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Jamie R Bhagwan
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Gary Duncan
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Diogo Mosqueira
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Maria Barbadillo Munoz
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Nguyen T N Vo
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| | - Chris Denning
- Department of Stem Cell Biology, Centre of Biomolecular Sciences, University of Nottingham , Nottingham, United Kingdom
| |
Collapse
|
13
|
Ershova AS, Rusinov IS, Spirin SA, Karyagina AS, Alexeevski AV. Role of Restriction-Modification Systems in Prokaryotic Evolution and Ecology. BIOCHEMISTRY (MOSCOW) 2016; 80:1373-86. [PMID: 26567582 DOI: 10.1134/s0006297915100193] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Restriction-modification (R-M) systems are able to methylate or cleave DNA depending on methylation status of their recognition site. It allows them to protect bacterial cells from invasion by foreign DNA. Comparative analysis of a large number of available bacterial genomes and methylomes clearly demonstrates that the role of R-M systems in bacteria is wider than only defense. R-M systems maintain heterogeneity of a bacterial population and are involved in adaptation of bacteria to change in their environmental conditions. R-M systems can be essential for host colonization by pathogenic bacteria. Phase variation and intragenomic recombinations are sources of the fast evolution of the specificity of R-M systems. This review focuses on the influence of R-M systems on evolution and ecology of prokaryotes.
Collapse
Affiliation(s)
- A S Ershova
- Belozerksy Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119991, Russia.
| | | | | | | | | |
Collapse
|
14
|
On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source. J Appl Probab 2016. [DOI: 10.1017/s0021900200007403] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In this paper we develop an explicit formula that allows us to compute the firstkmoments of the random count of a pattern in a multistate sequence generated by a Markov source. We derive efficient algorithms that allow us to deal with any pattern (low or high complexity) in any Markov model (homogeneous or not). We then apply these results to the distribution of DNA patterns in genomic sequences, and we show that moment-based developments (namely Edgeworth's expansion and Gram-Charlier type-B series) allow us to improve the reliability of common asymptotic approximations, such as Gaussian or Poisson approximations.
Collapse
|
15
|
Nuel G. On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source. J Appl Probab 2016. [DOI: 10.1239/jap/1294170523] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper we develop an explicit formula that allows us to compute the first k moments of the random count of a pattern in a multistate sequence generated by a Markov source. We derive efficient algorithms that allow us to deal with any pattern (low or high complexity) in any Markov model (homogeneous or not). We then apply these results to the distribution of DNA patterns in genomic sequences, and we show that moment-based developments (namely Edgeworth's expansion and Gram-Charlier type-B series) allow us to improve the reliability of common asymptotic approximations, such as Gaussian or Poisson approximations.
Collapse
|
16
|
Pleška M, Qian L, Okura R, Bergmiller T, Wakamoto Y, Kussell E, Guet C. Bacterial Autoimmunity Due to a Restriction-Modification System. Curr Biol 2016; 26:404-9. [DOI: 10.1016/j.cub.2015.12.041] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2015] [Revised: 11/08/2015] [Accepted: 12/10/2015] [Indexed: 01/25/2023]
|
17
|
Rusinov I, Ershova A, Karyagina A, Spirin S, Alexeevski A. Lifespan of restriction-modification systems critically affects avoidance of their recognition sites in host genomes. BMC Genomics 2015; 16:1084. [PMID: 26689194 PMCID: PMC4687349 DOI: 10.1186/s12864-015-2288-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 12/11/2015] [Indexed: 01/10/2023] Open
Abstract
Background Avoidance of palindromic recognition sites of Type II restriction-modification (R-M) systems was shown for many R-M systems in dozens of prokaryotic genomes. However the phenomenon has not been investigated systematically for all presently available genomes and annotated R-M systems. We have studied all known recognition sites in thousands of prokaryotic genomes and found factors that influence their avoidance. Results Only Type II R-M systems consisting of independently acting endonuclease and methyltransferase (called ‘orthodox’ here) cause avoidance of their sites, both palindromic and asymmetric, in corresponding prokaryotic genomes; the avoidance takes place for ~ 50 % of 1774 studied cases. It is known that prokaryotes can acquire and lose R-M systems. Thus it is possible to talk about the lifespan of an R-M system in a genome. We have shown that the recognition site avoidance correlates with the lifespan of R-M systems. The sites of orthodox R-M systems that are encoded in host genomes for a long time are avoided more often (up to 100 % in certain cohorts) than the sites of recently acquired ones. We also found cases of site avoidance in absence of the corresponding R-M systems in the genome. An analysis of closely related bacteria shows that such avoidance can be a trace of lost R-M systems. Sites of Type I, IIС/G, IIM, III, and IV R-M systems are not avoided in vast majority of cases. Conclusions The avoidance of orthodox Type II R-M system recognition sites in prokaryotic genomes is a widespread phenomenon. Presence of an R-M system without an underrepresentation of its site may indicate that the R-M system was acquired recently. At the same time, a significant underrepresentation of a site may be a sign of presence of the corresponding R-M system in this organism or in its ancestors for a long time. The drastic difference between site avoidance for orthodox Type II R-M systems and R-M systems of other types can be explained by a higher rate of specificity changes or a less self-toxicity of the latter. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2288-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ivan Rusinov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, 119992, Russia.
| | - Anna Ershova
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Gamaleya Center of Epidemiology and Microbiology, Moscow, 123098, Russia. .,Institute of Agricultural Biotechnology, the Russian Academy of Sciences, Moscow, 127550, Russia.
| | - Anna Karyagina
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Gamaleya Center of Epidemiology and Microbiology, Moscow, 123098, Russia. .,Institute of Agricultural Biotechnology, the Russian Academy of Sciences, Moscow, 127550, Russia.
| | - Sergey Spirin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Scientific Research Institute for System Studies, the Russian Academy of Science (NIISI RAS), Moscow, 117281, Russia.
| | - Andrei Alexeevski
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia. .,Scientific Research Institute for System Studies, the Russian Academy of Science (NIISI RAS), Moscow, 117281, Russia.
| |
Collapse
|
18
|
Herrera S, Reyes-Herrera PH, Shank TM. Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life. Genome Biol Evol 2015; 7:3207-25. [PMID: 26537225 PMCID: PMC4700943 DOI: 10.1093/gbe/evv210] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes--generically known as restriction site associated DNA sequencing (RAD-seq)--is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is knowledge of the approximate number of genetic markers that can be obtained for a taxon using different restriction enzymes, as this number determines the scope of a project, and ultimately defines its success. This number can only be directly determined if a reference genome sequence is available, or it can be estimated if the genome size and restriction recognition sequence probabilities are known. However, both scenarios are uncommon for nonmodel species. Here, we performed systematic in silico surveys of recognition sequences, for diverse and commonly used type II restriction enzymes across the eukaryotic tree of life. Our observations reveal that recognition sequence frequencies for a given restriction enzyme are strikingly variable among broad eukaryotic taxonomic groups, being largely determined by phylogenetic relatedness. We demonstrate that genome sizes can be predicted from cleavage frequency data obtained with restriction enzymes targeting "neutral" elements. Models based on genomic compositions are also effective tools to accurately calculate probabilities of recognition sequences across taxa, and can be applied to species for which reduced representation data are available (including transcriptomes and neutral RAD-seq data sets). The analytical pipeline developed in this study, PredRAD (https://github.com/phrh/PredRAD), and the resulting databases constitute valuable resources that will help guide the design of any study using RAD-seq or related methods.
Collapse
Affiliation(s)
- Santiago Herrera
- Biology Department, Woods Hole Oceanographic Institution Biology Department, Massachusetts Institute of Technology
| | | | | |
Collapse
|
19
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|
20
|
Régnier M, Furletova E, Yakovlev V, Roytberg M. Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models. Algorithms Mol Biol 2015; 9:25. [PMID: 25648087 PMCID: PMC4307674 DOI: 10.1186/s13015-014-0025-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 11/09/2014] [Indexed: 12/02/2022] Open
Abstract
Background Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. Results We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. Electronic supplementary material The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.
Collapse
|
21
|
Siranosian B, Perera S, Williams E, Ye C, de Graffenried C, Shank P. Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages. F1000Res 2015; 4:36. [PMID: 27134721 PMCID: PMC4841201 DOI: 10.12688/f1000research.6077.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/28/2015] [Indexed: 02/02/2023] Open
Abstract
Background The genomic sequences of mycobacteriophages, phages infecting mycobacterial hosts, are diverse and mosaic. Mycobacteriophages often share little nucleotide similarity, but most of them have been grouped into lettered clusters and further into subclusters. Traditionally, mycobacteriophage genomes are analyzed based on sequence alignment or knowledge of gene content. However, these approaches are computationally expensive and can be ineffective for significantly diverged sequences. As an alternative to alignment-based genome analysis, we evaluated tetranucleotide usage in mycobacteriophage genomes. These methods make it easier to characterize features of the mycobacteriophage population at many scales. Description We computed tetranucleotide usage deviation (TUD), the ratio of observed counts of 4-mers in a genome to the expected count under a null model. TUD values are comparable between members of a phage subcluster and distinct between subclusters. With few exceptions, neighbor joining phylogenetic trees and hierarchical clustering dendrograms constructed using TUD values place phages in a monophyletic clade with members of the same subcluster. Regions in a genome with exceptional TUD values can point to interesting features of genomic architecture. Finally, we found that subcluster B3 mycobacteriophages contain significantly overrepresented 4-mers and 6-mers that are atypical of phage genomes. Conclusions Statistics based on tetranucleotide usage support established clustering of mycobacteriophages and can uncover interesting relationships within and between sequenced phage genomes. These methods are efficient to compute and do not require sequence alignment or knowledge of gene content. The code to download mycobacteriophage genome sequences and reproduce our analysis is freely available at
https://github.com/bsiranosian/tango_final.
Collapse
Affiliation(s)
- Benjamin Siranosian
- Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA; Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA
| | - Sudheesha Perera
- Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA
| | - Edward Williams
- Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA
| | - Chen Ye
- Division of Biology and Medicine, Brown University, Providence, RI, 02912, USA
| | | | - Peter Shank
- Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, 02912, USA
| |
Collapse
|
22
|
Kupczok A, Bollback JP. Motif depletion in bacteriophages infecting hosts with CRISPR systems. BMC Genomics 2014; 15:663. [PMID: 25103210 PMCID: PMC4246573 DOI: 10.1186/1471-2164-15-663] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 02/15/2014] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND CRISPR is a microbial immune system likely to be involved in host-parasite coevolution. It functions using target sequences encoded by the bacterial genome, which interfere with invading nucleic acids using a homology-dependent system. The system also requires protospacer associated motifs (PAMs), short motifs close to the target sequence that are required for interference in CRISPR types I and II. Here, we investigate whether PAMs are depleted in phage genomes due to selection pressure to escape recognition. RESULTS To this end, we analyzed two data sets. Phages infecting all bacterial hosts were analyzed first, followed by a detailed analysis of phages infecting the genus Streptococcus, where PAMs are best understood. We use two different measures of motif underrepresentation that control for codon bias and the frequency of submotifs. We compare phages infecting species with a particular CRISPR type to those infecting species without that type. Since only known PAMs were investigated, the analysis is restricted to CRISPR types I-C and I-E and in Streptococcus to types I-C and II. We found evidence for PAM depletion in Streptococcus phages infecting hosts with CRISPR type I-C, in Vibrio phages infecting hosts with CRISPR type I-E and in Streptococcus thermopilus phages infecting hosts with type II-A, known as CRISPR3. CONCLUSIONS The observed motif depletion in phages with hosts having CRISPR can be attributed to selection rather than to mutational bias, as mutational bias should affect the phages of all hosts. This observation implies that the CRISPR system has been efficient in the groups discussed here.
Collapse
Affiliation(s)
- Anne Kupczok
- />IST Austria, Am Campus 1, 3400 Klosterneuburg, Austria
- />Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany
| | | |
Collapse
|
23
|
Clustering of giant virus-DNA based on variations in local entropy. Viruses 2014; 6:2259-67. [PMID: 24887142 PMCID: PMC4074927 DOI: 10.3390/v6062259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 05/19/2014] [Accepted: 05/21/2014] [Indexed: 11/17/2022] Open
Abstract
We present a method for clustering genomic sequences based on variations in local entropy. We have analyzed the distributions of the block entropies of viruses and plant genomes. A distinct pattern for viruses and plant genomes is observed. These distributions, which describe the local entropic variability of the genomes, are used for clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS distances between all genomes that infect the chlorella algae shows the host specificity of the viruses. We illustrate the efficacy of this entropy-based clustering technique by the segregation of plant and virus genomes into separate bins.
Collapse
|
24
|
Maldonado-Contreras A, Mane SP, Zhang XS, Pericchi L, Alarcón T, Contreras M, Linz B, Blaser MJ, Domínguez-Bello MG. Phylogeographic evidence of cognate recognition site patterns and transformation efficiency differences in H. pylori: theory of strain dominance. BMC Microbiol 2013; 13:211. [PMID: 24050390 PMCID: PMC3849833 DOI: 10.1186/1471-2180-13-211] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Accepted: 08/28/2013] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Helicobacter pylori has diverged in parallel to its human host, leading to distinct phylogeographic populations. Recent evidence suggests that in the current human mixing in Latin America, European H. pylori (hpEurope) are increasingly dominant at the expense of Amerindian haplotypes (hspAmerind). This phenomenon might occur via DNA recombination, modulated by restriction-modification systems (RMS), in which differences in cognate recognition sites (CRS) and in active methylases will determine direction and frequency of gene flow. We hypothesized that genomes from hspAmerind strains that evolved from a small founder population have lost CRS for RMS and active methylases, promoting hpEurope's DNA invasion. We determined the observed and expected frequencies of CRS for RMS in DNA from 7 H. pylori whole genomes and 110 multilocus sequences. We also measured the number of active methylases by resistance to in vitro digestion by 16 restriction enzymes of genomic DNA from 9 hpEurope and 9 hspAmerind strains, and determined the direction of DNA uptake in co-culture experiments of hspAmerind and hpEurope strains. RESULTS Most of the CRS were underrepresented with consistency between whole genomes and multilocus sequences. Although neither the frequency of CRS nor the number of active methylases differ among the bacterial populations (average 8.6 ± 2.6), hspAmerind strains had a restriction profile distinct from that in hpEurope strains, with 15 recognition sites accounting for the differences. Amerindians strains also exhibited higher transformation rates than European strains, and were more susceptible to be subverted by larger DNA hpEurope-fragments than vice versa. CONCLUSIONS The geographical variation in the pattern of CRS provides evidence for ancestral differences in RMS representation and function, and the transformation findings support the hypothesis of Europeanization of the Amerindian strains in Latin America via DNA recombination.
Collapse
|
25
|
Roberts GA, Houston PJ, White JH, Chen K, Stephanou AS, Cooper LP, Dryden DT, Lindsay JA. Impact of target site distribution for Type I restriction enzymes on the evolution of methicillin-resistant Staphylococcus aureus (MRSA) populations. Nucleic Acids Res 2013; 41:7472-84. [PMID: 23771140 PMCID: PMC3753647 DOI: 10.1093/nar/gkt535] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2013] [Revised: 05/16/2013] [Accepted: 05/22/2013] [Indexed: 12/16/2022] Open
Abstract
A limited number of Methicillin-resistant Staphylococcus aureus (MRSA) clones are responsible for MRSA infections worldwide, and those of different lineages carry unique Type I restriction-modification (RM) variants. We have identified the specific DNA sequence targets for the dominant MRSA lineages CC1, CC5, CC8 and ST239. We experimentally demonstrate that this RM system is sufficient to block horizontal gene transfer between clinically important MRSA, confirming the bioinformatic evidence that each lineage is evolving independently. Target sites are distributed randomly in S. aureus genomes, except in a set of large conjugative plasmids encoding resistance genes that show evidence of spreading between two successful MRSA lineages. This analysis of the identification and distribution of target sites explains evolutionary patterns in a pathogenic bacterium. We show that a lack of specific target sites enables plasmids to evade the Type I RM system thereby contributing to the evolution of increasingly resistant community and hospital MRSA.
Collapse
Affiliation(s)
- Gareth A. Roberts
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - Patrick J. Houston
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - John H. White
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - Kai Chen
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - Augoustinos S. Stephanou
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - Laurie P. Cooper
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - David T.F. Dryden
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| | - Jodi A. Lindsay
- EaStCHEM School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh EH9 3JJ, UK and Division of Clinical Sciences, St. George’s, University of London, Cranmer Terrace, London, SW17 0RE, UK
| |
Collapse
|
26
|
CpG underrepresentation and the bacterial CpG-specific DNA methyltransferase M.MpeI. Proc Natl Acad Sci U S A 2012; 110:105-10. [PMID: 23248272 DOI: 10.1073/pnas.1207986110] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Cytosine methylation promotes deamination. In eukaryotes, CpG methylation is thought to account for CpG underrepresentation. Whether scarcity of CpGs in prokaryotic genomes is diagnostic for methylation is not clear. Here, we report that Mycoplasms tend to be CpG depleted and to harbor a family of constitutively expressed or phase variable CpG-specific DNA methyltransferases. The very CpG poor Mycoplasma penetrans and its constitutively active CpG-specific methyltransferase M.MpeI were chosen for further characterization. Genome-wide sequencing of bisulfite-converted DNA indicated that M.MpeI methylated CpG target sites both in vivo and in vitro in a locus-nonselective manner. A crystal structure of M.MpeI with DNA at 2.15-Å resolution showed that the substrate base was flipped and that its place in the DNA stack was taken by a glutamine residue. A phenylalanine residue was intercalated into the "weak" CpG step of the nonsubstrate strand, indicating mechanistic similarities in the recognition of the short CpG target sequence by prokaryotic and eukaryotic DNA methyltransferases.
Collapse
|
27
|
Regad L, Martin J, Camproux AC. Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs. BMC Bioinformatics 2011; 12:247. [PMID: 21689388 PMCID: PMC3158783 DOI: 10.1186/1471-2105-12-247] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2010] [Accepted: 06/20/2011] [Indexed: 12/24/2022] Open
Abstract
Background One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function. Results Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM. Conclusions Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.
Collapse
|
28
|
Regad L, Saladin A, Maupetit J, Geneix C, Camproux AC. SA-Mot: a web server for the identification of motifs of interest extracted from protein loops. Nucleic Acids Res 2011; 39:W203-9. [PMID: 21665924 PMCID: PMC3125790 DOI: 10.1093/nar/gkr410] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The detection of functional motifs is an important step for the determination of protein functions. We present here a new web server SA-Mot (Structural Alphabet Motif) for the extraction and location of structural motifs of interest from protein loops. Contrary to other methods, SA-Mot does not focus only on functional motifs, but it extracts recurrent and conserved structural motifs involved in structural redundancy of loops. SA-Mot uses the structural word notion to extract all structural motifs from uni-dimensional sequences corresponding to loop structures. Then, SA-Mot provides a description of these structural motifs using statistics computed in the loop data set and in SCOP superfamily, sequence and structural parameters. SA-Mot results correspond to an interactive table listing all structural motifs extracted from a target structure and their associated descriptors. Using this information, the users can easily locate loop regions that are important for the protein folding and function. The SA-Mot web server is available at http://sa-mot.mti.univ-paris-diderot.fr.
Collapse
Affiliation(s)
- Leslie Regad
- INSERM, U973, Université Paris 7-Paris Diderot, UMR-S973, MTi F-75013 Paris, France.
| | | | | | | | | |
Collapse
|
29
|
Matthews S, Rao VS, Durvasula RV. Modeling horizontal gene transfer (HGT) in the gut of the Chagas disease vector Rhodnius prolixus. Parasit Vectors 2011; 4:77. [PMID: 21569540 PMCID: PMC3117810 DOI: 10.1186/1756-3305-4-77] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2010] [Accepted: 05/14/2011] [Indexed: 11/20/2022] Open
Abstract
Background Paratransgenesis is an approach to reducing arthropod vector competence using genetically modified symbionts. When applied to control of Chagas disease, the symbiont bacterium Rhodococcus rhodnii, resident in the gut lumen of the triatomine vector Rhodnius prolixus (Hemiptera: Reduviidae), is transformed to export cecropin A, an insect immune peptide. Cecropin A is active against Trypanosoma cruzi, the causative agent of Chagas disease. While proof of concept has been achieved in laboratory studies, a rigorous and comprehensive risk assessment is required prior to consideration of field release. An important part of this assessment involves estimating probability of transgene horizontal transfer to environmental organisms (HGT). This article presents a two-part risk assessment methodology: a theoretical model predicting HGT in the gut of R. prolixus from the genetically transformed symbiont R. rhodnii to a closely related non-target bacterium, Gordona rubropertinctus, in the absence of selection pressure, and a series of laboratory trials designed to test the model. Results The model predicted an HGT frequency of less than 1.14 × 10-16 per 100,000 generations at the 99% certainty level. The model was iterated twenty times, with the mean of the ten highest outputs evaluated at the 99% certainty level. Laboratory trials indicated no horizontal gene transfer, supporting the conclusions of the model. Conclusions The model treats HGT as a composite event, the probability of which is determined by the joint probability of three independent events: gene transfer through the modalities of transformation, transduction, and conjugation. Genes are represented in matrices and Monte Carlo method and Markov chain analysis are used to simulate and evaluate environmental conditions. The model is intended as a risk assessment instrument and predicts HGT frequency of less than 1.14 × 10-16 per 100,000 generations. With laboratory studies that support the predictions of this model, it may be possible to argue that HGT is a negligible consideration in risk assessment of genetically modified R. rhodnii released for control of Chagas disease.
Collapse
Affiliation(s)
- Scott Matthews
- Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87108, USA
| | | | | |
Collapse
|
30
|
Ekisheva S, Borodovsky M. Uniform Accuracy of the Maximum Likelihood Estimates for Probabilistic Models of Biological Sequences. Methodol Comput Appl Probab 2011; 13:105-120. [PMID: 21318122 PMCID: PMC3035201 DOI: 10.1007/s11009-009-9125-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Probabilistic models for biological sequences (DNA and proteins) have many useful applications in bioinformatics. Normally, the values of parameters of these models have to be estimated from empirical data. However, even for the most common estimates, the maximum likelihood (ML) estimates, properties have not been completely explored. Here we assess the uniform accuracy of the ML estimates for models of several types: the independence model, the Markov chain and the hidden Markov model (HMM). Particularly, we derive rates of decay of the maximum estimation error by employing the measure concentration as well as the Gaussian approximation, and compare these rates.
Collapse
Affiliation(s)
- Svetlana Ekisheva
- Department of Mathematics, Syktyvkar State University, Oktjabrskii pr., 55, Syktyvkar, 167000, Russia
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering and Computational Science and Engineering Division, Georgia Institute of Technology, Atlanta, GA 30332-0535, USA,
| |
Collapse
|
31
|
Babbitt GA. Relaxed selection against accidental binding of transcription factors with conserved chromatin contexts. Gene 2010; 466:43-8. [PMID: 20637845 DOI: 10.1016/j.gene.2010.07.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2010] [Revised: 06/30/2010] [Accepted: 07/07/2010] [Indexed: 02/03/2023]
Abstract
The spurious (or nonfunctional) binding of transcription factors (TF) to the wrong locations on DNA presents a formidable challenge to genomes given the relatively low ceiling for sequence complexity within the short lengths of most binding motifs. The high potential for the occurrence of random motifs and subsequent nonfunctional binding of many transcription factors should theoretically lead to natural selection against the occurrence of spurious motif throughout the genome. However, because of the active role that chromatin can influence over eukaryotic gene regulation, it may also be expected that many supposed spurious binding sites could escape purifying selection if (A) they simply occur in regions of high nucleosome occupancy or (B) their surrounding chromatin was dynamically involved in their identity and function. We compared nucleosome occupancy and the presence/absence of functionally conserved chromatin context to the strength of selection against spurious binding of various TF binding motifs in Saccharomyces yeast. While we find no direct relationship with nucleosome occupancy, we find strong evidence that transcription factors spatially associated with evolutionarily conserved chromatin states are under relaxed selection against accidental binding. Transcription factors (with/without) a conserved chromatin context were found to occur on average, (87.7%/49.3%) of their expected frequencies. Functional binding motifs with conserved chromatin contexts were also significantly shorter in length and more often clustered. These results indicate a role of chromatin context dependency in relaxing selection against spurious binding in nearly half of all TF binding motifs throughout the yeast genome.
Collapse
Affiliation(s)
- G A Babbitt
- School of Biological and Medical Sciences, Rochester Institute of Technology, USA.
| |
Collapse
|
32
|
Moineau S, Pandian S, Klaenhammer TR. Restriction/Modification systems and restriction endonucleases are more effective on lactococcal bacteriophages that have emerged recently in the dairy industry. Appl Environ Microbiol 2010; 59:197-202. [PMID: 16348842 PMCID: PMC202077 DOI: 10.1128/aem.59.1.197-202.1993] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Recently, eight lytic small isometric-headed bacteriophages were isolated from cheese-manufacturing plants throughout North America. The eight phages were different, but all propagated on one strain, Lactococcus lactis NCK203. On the basis of DNA homology, they were classified in the P335 species. Digestion of their genomes in vitro with restriction enzymes resulted in an unusually high number of type II endonuclease sites compared with the more common lytic phages of the 936 (small isometric-headed) and c2 (prolate-headed) species. In vivo, the P335 phages were more sensitive to four distinct lactococcal restriction and modification (R/M) systems than phages belonging to the 936 and c2 species. A significant correlation was found between the number of restriction sites for endonucleases (purified from other bacterial genera) and the relative susceptibility of phages to lactococcal R/M systems. Comparisons among these three phage species indicate that the P335 species may have emerged most recently in the dairy industry.
Collapse
Affiliation(s)
- S Moineau
- Department of Food Science and Southeast Dairy Foods Research Center, North Carolina State University, Box 7624, Raleigh, North Carolina 27695-7624
| | | | | |
Collapse
|
33
|
Yang B, Peng Y, Leung HCM, Yiu SM, Chen JC, Chin FYL. Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers. BMC Bioinformatics 2010; 11 Suppl 2:S5. [PMID: 20406503 PMCID: PMC3165929 DOI: 10.1186/1471-2105-11-s2-s5] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as "binning". Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases. RESULTS In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets. Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%. CONCLUSIONS We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.
Collapse
Affiliation(s)
- Bin Yang
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing, Jiangsu, 210096 PR China.
| | | | | | | | | | | |
Collapse
|
34
|
Zhai Z, Ku SY, Luan Y, Reinert G, Waterman MS, Sun F. The power of detecting enriched patterns: an HMM approach. J Comput Biol 2010; 17:581-92. [PMID: 20426691 PMCID: PMC3203519 DOI: 10.1089/cmb.2009.0218] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.
Collapse
Affiliation(s)
- Zhiyuan Zhai
- School of Mathematics, Shandong University, Jinan, Shandong, P.R. China
| | - Shih-Yen Ku
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong, P.R. China
| | - Gesine Reinert
- Department of Statistics, Oxford University, Oxford, United Kingdom
| | - Michael S. Waterman
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China
| |
Collapse
|
35
|
Mining protein loops using a structural alphabet and statistical exceptionality. BMC Bioinformatics 2010; 11:75. [PMID: 20132552 PMCID: PMC2833150 DOI: 10.1186/1471-2105-11-75] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2009] [Accepted: 02/04/2010] [Indexed: 12/21/2022] Open
Abstract
Background Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. Results We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. Conclusions We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.
Collapse
|
36
|
Nuel G, Regad L, Martin J, Camproux AC. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithms Mol Biol 2010; 5:15. [PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 01/26/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Collapse
Affiliation(s)
- Gregory Nuel
- LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France
- CNRS, Paris, France
- MAP5, Department of Applied Mathematics, CNRS UMR-8145, University Paris Descartes, Paris, France
| | - Leslie Regad
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| | - Juliette Martin
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MIG, Mathématique Informatique et Genome, INRA UR-1077, Jouy-en-Josas, France
- IBCP, Institut de Biologie et Chimie des Protéines, IFR 128, CNRS UMR 5086, University of Lyon 1, Lyon, France
| | - Anne-Claude Camproux
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| |
Collapse
|
37
|
Yokoyama KD, Ohler U, Wray GA. Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucleic Acids Res 2009; 37:e92. [PMID: 19483094 PMCID: PMC2715254 DOI: 10.1093/nar/gkp423] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Transcriptional regulation is mediated by the collective binding of proteins called transcription factors to cis-regulatory elements. A handful of factors are known to function at particular distances from the transcription start site, although the extent to which this occurs is not well understood. Spatial dependencies can also exist between pairs of binding motifs, facilitating factor-pair interactions. We sought to determine to what extent spatial preferences measured at high-scale resolution could be utilized to predict cis-regulatory elements as well as motif-pairs binding interacting proteins. We introduce the ‘motif positional function’ model which predicts spatial biases using regression analysis, differentiating noise from true position-specific overrepresentation at single-nucleotide resolution. Our method predicts 48 consensus motifs exhibiting positional enrichment within human promoters, including fourteen motifs without known binding partners. We then extend the model to analyze distance preferences between pairs of motifs. We find that motif-pairs binding interacting factors often co-occur preferentially at multiple distances, with intervals between preferred distances often corresponding to the turn of the DNA double-helix. This offers a novel means by which to predict sequence elements with a collective role in gene regulation.
Collapse
Affiliation(s)
- Ken Daigoro Yokoyama
- Biology Department, Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA
| | | | | |
Collapse
|
38
|
Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y. A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment. BMC Genomics 2009; 10:229. [PMID: 19445709 PMCID: PMC2696472 DOI: 10.1186/1471-2164-10-229] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2008] [Accepted: 05/16/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics. RESULTS To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment. CONCLUSION The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.
Collapse
Affiliation(s)
- Shani Tzahor
- Faculty of Biology, Technion – Israel Institute of Technology, Haifa 32000, Israel
- Inter-Departmental Program for Biotechnology, Technion – Israel Institute of Technology, Haifa 32000, Israel
| | | | - Benjamin C Kirkup
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Tali Yogev
- Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 52900, Israel
| | | | - Martin F Polz
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Oded Béjà
- Faculty of Biology, Technion – Israel Institute of Technology, Haifa 32000, Israel
| | | |
Collapse
|
39
|
Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y. A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment. BMC Genomics 2009. [PMID: 19445709 DOI: 10.1186/1471-2164-10-229.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics. RESULTS To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment. CONCLUSION The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.
Collapse
Affiliation(s)
- Shani Tzahor
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel.
| | | | | | | | | | | | | | | |
Collapse
|
40
|
Pavlović-Lazetić GM, Mitić NS, Beljanski MV. n-Gram characterization of genomic islands in bacterial genomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009; 93:241-56. [PMID: 19101056 PMCID: PMC7185697 DOI: 10.1016/j.cmpb.2008.10.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Revised: 09/10/2008] [Accepted: 10/21/2008] [Indexed: 05/27/2023]
Abstract
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.
Collapse
|
41
|
Mitrophanov AY, Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform 2008; 7:2-24. [PMID: 16761361 DOI: 10.1093/bib/bbk001] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.
Collapse
|
42
|
Bakkali M. Genome dynamics of short oligonucleotides: the example of bacterial DNA uptake enhancing sequences. PLoS One 2007; 2:e741. [PMID: 17710141 PMCID: PMC1939737 DOI: 10.1371/journal.pone.0000741] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 06/29/2007] [Indexed: 11/19/2022] Open
Abstract
Among the many bacteria naturally competent for transformation by DNA uptake-a phenomenon with significant clinical and financial implications- Pasteurellaceae and Neisseriaceae species preferentially take up DNA containing specific short sequences. The genomic overrepresentation of these DNA uptake enhancing sequences (DUES) causes preferential uptake of conspecific DNA, but the function(s) behind this overrepresentation and its evolution are still a matter for discovery. Here I analyze DUES genome dynamics and evolution and test the validity of the results to other selectively constrained oligonucleotides. I use statistical methods and computer simulations to examine DUESs accumulation in Haemophilus influenzae and Neisseria gonorrhoeae genomes. I analyze DUESs sequence and nucleotide frequencies, as well as those of all their mismatched forms, and prove the dependence of DUESs genomic overrepresentation on their preferential uptake by quantifying and correlating both characteristics. I then argue that mutation, uptake bias, and weak selection against DUESs in less constrained parts of the genome combined are sufficient enough to cause DUESs accumulation in susceptible parts of the genome with no need for other DUES function. The distribution of overrepresentation values across sequences with different mismatch loads compared to the DUES suggests a gradual yet not linear molecular drive of DNA sequences depending on their similarity to the DUES. Other genomically overrepresented sequences, both pro- and eukaryotic, show similar distribution of frequencies suggesting that the molecular drive reported above applies to other frequent oligonucleotides. Rare oligonucleotides, however, seem to be gradually drawn to genomic underrepresentation, thus, suggesting a molecular drag. To my knowledge this work provides the first clear evidence of the gradual evolution of selectively constrained oligonucleotides, including repeated, palindromic and protein/transcription factor-binding DNAs.
Collapse
Affiliation(s)
- Mohammed Bakkali
- Institute of Genetics, Queen's Medical Center, University of Nottingham, Nottingham, United Kingdom.
| |
Collapse
|
43
|
Pristas P, Piknova M. Underrepresentation of short palindromes in Selenomonas ruminantium DNA: evidence for horizontal gene transfer of restriction and modification systems? Can J Microbiol 2005; 51:315-8. [PMID: 15980893 DOI: 10.1139/w05-004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Molecular analysis of isolates of the rumen bacterium Selenomonas ruminantium revealed a high variety and frequency of site-specific (restriction) endonucleases. While all known S. ruminantium restriction and modification systems recognize hexanucleotide sequences only, consistently low counts of both 6-bp and 4-bp palindromes were found in DNA sequences of S. ruminantium. Statistical analysis indicated that there is some correlation between the degree of underrepresentation of tetranucleotide words and the number of known restriction endonucleases for a given sequence. Control analysis showed the same correlation in lambda DNA but not in human adenovirus DNA. Based on the data presented, it could be proposed that there is a much higher historical occurrence of restriction and modification systems in S. ruminantium and (or) frequent horizontal gene transfer of restriction and modification gene complexes.
Collapse
Affiliation(s)
- Peter Pristas
- Institute of Animal Physiology, Slovak Academy of Sciences, Soltesovej 4-6, 04001 Kosice, Slovak Republic.
| | | |
Collapse
|
44
|
Abstract
Statistics on Markov chains are widely used for the study of patterns in biological sequences. Statistics on these models can be done through several approaches. Central limit theorem (CLT) producing Gaussian approximations are one of the most popular ones. Unfortunately, in order to find a pattern of interest, these methods have to deal with tail distribution events where CLT is especially bad. In this paper, we propose a new approach based on the large deviations theory to assess pattern statistics. We first recall theoretical results for empiric mean (level 1) as well as empiric distribution (level 2) large deviations on Markov chains. Then, we present the applications of these results focusing on numerical issues. LD-SPatt is the name of GPL software implementing these algorithms. We compare this approach to several existing ones in terms of complexity and reliability and show that the large deviations are more reliable than the Gaussian approximations in absolute values as well as in terms of ranking and are at least as reliable as compound Poisson approximations. We then finally discuss some further possible improvements and applications of this new method.
Collapse
Affiliation(s)
- G Nuel
- Laboratoire Statistique et Génome, Tour Evry 2, 523 place des terasses, 91034 Evry, France.
| |
Collapse
|
45
|
van Passel MWJ, Bart A, Waaijer RJA, Luyf ACM, van Kampen AHC, van der Ende A. An in vitro strategy for the selective isolation of anomalous DNA from prokaryotic genomes. Nucleic Acids Res 2004; 32:e114. [PMID: 15304543 PMCID: PMC514399 DOI: 10.1093/nar/gnh115] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
In sequenced genomes of prokaryotes, anomalous DNA (aDNA) can be recognized, among others, by atypical clustering of dinucleotides. We hypothesized that atypical clustering of hexameric endonuclease recognition sites in aDNA allows the specific isolation of anomalous sequences in vitro. Clustering of endonuclease recognition sites in aDNA regions of eight published prokaryotic genome sequences was demonstrated. In silico digestion of the Neisseria meningitidis MC58 genome, using four selected endonucleases, revealed that out of 27 of the small fragments predicted (<5 kb), 21 were located in known genomic islands. Of the 24 calculated fragments (>300 bp and <5 kb), 22 met our criteria for aDNA, i.e. a high dinucleotide dissimilarity and/or aberrant GC content. The four enzymes also allowed the identification of aDNA fragments from the related Z2491 strain. Similarly, the sequenced genomes of three strains of Escherichia coli assessed by in silico digestion using XbaI yielded strain-specific sets of fragments of anomalous composition. In vitro applicability of the method was demonstrated by using adaptor-linked PCR, yielding the predicted fragments from the N.meningitidis MC58 genome. In conclusion, this strategy allows the selective isolation of aDNA from prokaryotic genomes by a simple restriction digest-amplification-cloning-sequencing scheme.
Collapse
Affiliation(s)
- M W J van Passel
- Department of Medical Microbiology, Academic Medical Center, Amsterdam, The Netherlands
| | | | | | | | | | | |
Collapse
|
46
|
Fuglsang A. The relationship between palindrome avoidance and intragenic codon usage variations: a Monte Carlo study. Biochem Biophys Res Commun 2004; 316:755-62. [PMID: 15033465 DOI: 10.1016/j.bbrc.2004.02.117] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2004] [Indexed: 10/26/2022]
Abstract
Several studies have shown that codon usage within genes varies, as it seems dependent on both codon context and codon position within the gene. Given that palindromes in addition often are avoided in genomes, this study aimed at finding out if intragenic variations in codon usage may be a way to control the amount and location of palindromes. A Monte Carlo algorithm was written which resampled the codons in genes while keeping the amino acid sequence of the translation product constant. On the resampled sequences, palindromes were counted and their intragenic positions mapped. Escherichia coli K12 uses type II restriction-modification systems and displays pronounced codon usage phenomena. Using this as a reference organism it was clearly shown that the number of palindromes in genes is generally lower than the amount of palindromes in resampled genes; thus, the succession of codons seems to be a way to decrease the number of palindromes. The intragenic position of palindromes in resampled sequences, however, was largely equal to the position in the native genes, so codon usage phenomena are unlikely to be a way to control the intragenic position of palindromes. The analysis was repeated on two bacteriophages and gave similar same results, even though the virus genomes are much smaller. Studies on the endosymbionts Buchnera sp. APS and Wigglesworthia sp., which seemingly have no type II restriction-modification systems, showed that in these species there is only weak evidence for codon usage acting to control the number of palindromes.
Collapse
Affiliation(s)
- Anders Fuglsang
- Danish University of Pharmaceutical Sciences, Institute of Pharmacology, Copenhagen.
| |
Collapse
|
47
|
Mruk I, Cichowicz M, Kaczorowski T. Characterization of the LlaCI methyltransferase from Lactococcus lactis subsp. cremoris W15 provides new insights into the biology of type II restriction-modification systems. MICROBIOLOGY-SGM 2004; 149:3331-3341. [PMID: 14600245 DOI: 10.1099/mic.0.26562-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The gene encoding the LlaCI methyltransferase (M.LlaCI) from Lactococcus lactis subsp. cremoris W15 was overexpressed in Escherichia coli. The enzyme was purified to apparent homogeneity using three consecutive steps of chromatography on phosphocellulose, blue-agarose and Superose 12HR, yielding a protein of M(r) 31 300+/-1000 under denaturing conditions. The exact position of the start codon AUG was determined by protein microsequencing. This enzyme recognizes the specific palindromic sequence 5'-AAGCTT-3'. Purified M.LlaCI was characterized. Unlike many other methyltransferases, M.LlaCI exists in solution predominantly as a dimer. It modifies the first adenine residue at the 5' end of the specific sequence to N(6)-methyladenine and thus is functionally identical to the corresponding methyltransferases of the HindIII (Haemophilus influenzae Rd) and EcoVIII (Escherichia coli E1585-68) restriction-modification systems. This is reflected in the identity of M.LlaCI with M.HindIII and M.EcoVIII noted at the amino acid sequence level (50 % and 62 %, respectively) and in the presence of nine sequence motifs conserved among N(6)-adenine beta-class methyltransferases. However, polyclonal antibodies raised against M.EcoVIII cross-reacted with M.LlaCI but not with M.HindIII. Restriction endonucleases require Mg(2+) for phosphodiester bond cleavage. Mg(2+) was shown to be a strong inhibitor of the M.LlaCI enzyme and its isospecific homologues. This observation suggests that sensitivity of the M.LlaCI to Mg(2+) may strengthen the restriction activity of the cognate endonuclease in the bacterial cell. Other biological implications of this finding are also discussed.
Collapse
Affiliation(s)
- Iwona Mruk
- Department of Microbiology, University of Gdańsk, Kładki 24, 80-822 Gdańsk, Poland
| | - Magdalena Cichowicz
- Department of Microbiology, University of Gdańsk, Kładki 24, 80-822 Gdańsk, Poland
| | - Tadeusz Kaczorowski
- Department of Microbiology, University of Gdańsk, Kładki 24, 80-822 Gdańsk, Poland
| |
Collapse
|
48
|
Fuglsang A. Bias explorer: measurements of compositional bias in EMBL and GenBank sequence files. Antonie Van Leeuwenhoek 2004; 86:313-5. [PMID: 15702383 DOI: 10.1007/s10482-004-0353-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
A Windows application for compositional analysis of sequenced genomes (EMBL or GenBank flat files) is available as freeware. The application allows the user to quantify word bias using Markov chain analysis and it allows the user to generate sliding window data for GC-skew, AT-skew, purine excess, keto excess and discrete word counts. The mathematical routines reside in a dynamic link library (DLL), which can be used independently by other applications. The software is available for download at http://www.dfuni.dk/~anfu/Bioinformatics/Main.htm.
Collapse
Affiliation(s)
- Anders Fuglsang
- Danish University of Pharmaceutical Sciences, Institute of Pharmacology, 2 Universitetsparken, DK-2100, Copenhagen Ø, Denmark.
| |
Collapse
|
49
|
Chew DSH, Choi KP, Heidner H, Leung MY. Palindromes in SARS and Other Coronaviruses. INFORMS JOURNAL ON COMPUTING 2004; 16:331-340. [PMID: 24966663 PMCID: PMC4066412 DOI: 10.1287/ijoc.1040.0087] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
With the identification of a novel coronavirus associated with the severe acute respiratory syndrome (SARS), computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palindromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markov-chain model for the genome sequence, the mean and standard deviation for the number of palindromes at or above a given length are derived. These theoretical results are complemented by extensive simulations to provide empirical estimates. Using a z score obtained from these mathematical and empirical means and standard deviations, we have observed that palindromes of length four are significantly underrepresented in all the coronaviruses in our data set. In contrast, length-six palindromes are significantly underrepresented only in the SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length-22 palindrome TCTTTAACAAGCTTGTTAAAGA spanning positions 25962-25983. Second, there are two repeating length-12 palindromes TTATAATTATAA spanning positions 22712-22723 and 22796-22807. Some further investigations into possible biological implications of these palindrome features are proposed.
Collapse
Affiliation(s)
- David S. H. Chew
- Department of Mathematics, National University of Singapore, Singapore 117543, Singapore
| | - Kwok Pui Choi
- Departments of Mathematics, and of Statistics and Applied Probability, National University of Singapore, Singapore 117543, Singapore
| | - Hans Heidner
- Department of Biology, University of Texas at San Antonio, San Antonio, Texas 78249, USA
| | - Ming-Ying Leung
- Department of Mathematical Sciences, University of Texas at El Paso, El Paso, Texas 79968, USA
| |
Collapse
|
50
|
Fuglsang A. Distribution of potential type II restriction sites (palindromes) in prokaryotes. Biochem Biophys Res Commun 2003; 310:280-5. [PMID: 14521907 DOI: 10.1016/j.bbrc.2003.09.014] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Restriction-modification systems are used as a defensive mechanism against inappropriate invasion of foreign DNA. The recognition sequences for the common type II restriction enzymes and their corresponding methylases are usually palindromes. In this study, we identified the most over- and underrepresented words in DNA of four bacteria: Escherichia coli, Bacillus subtilis, Clostridium perfringens, and Pseudomonas aeruginosa. Using maximum order Markov chain analysis, we found that palindromic words were most often more underrepresented than their non-palindromic counterparts. No strict rule for the intragenic palindrome content could be derived, but for three of the bacteria there was a weak correlation between codon usage bias and palindrome content. A clear drop in palindrome counts was observed in the Shine-Dalgarno region for B. subtilis and C. perfringens, but not in E. coli or P. aeruginosa. It was also shown that palindromes in eubacteria and archaebacteria seem to occur slightly more infrequently than expected on the basis of the genomic GC-content, but some exceptions to this principle exist.
Collapse
Affiliation(s)
- Anders Fuglsang
- Institute of Pharmacology, Danish University of Pharmaceutical Sciences, Universitetsparken 2, DK-2100 Copenhagen Ø, Denmark.
| |
Collapse
|