1
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. BIOLOGY 2023; 12:849. [PMID: 37372134 PMCID: PMC10295253 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands;
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L. Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
2
|
de la Fuente R, Díaz-Villanueva W, Arnau V, Moya A. Genomic Signature in Evolutionary Biology: A Review. BIOLOGY 2023; 12:biology12020322. [PMID: 36829597 PMCID: PMC9953303 DOI: 10.3390/biology12020322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/11/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
Collapse
Affiliation(s)
- Rebeca de la Fuente
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Correspondence:
| | - Wladimiro Díaz-Villanueva
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of the Valencian Community (FISABIO), 46020 Valencia, Spain
- CIBER in Epidemiology and Public Health (CIBEResp), 28029 Madrid, Spain
| |
Collapse
|
3
|
Greenberg G, Shomorony I. Improving bacterial genome assembly using a test of strand orientation. Bioinformatics 2022; 38:ii34-ii41. [PMID: 36124787 DOI: 10.1093/bioinformatics/btac516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
SUMMARY The complexity of genome assembly is due in large part to the presence of repeats. In particular, large reverse-complemented repeats can lead to incorrect inversions of large segments of the genome. To detect and correct such inversions in finished bacterial genomes, we propose a statistical test based on tetranucleotide frequency (TNF), which determines whether two segments from the same genome are of the same or opposite orientation. In most cases, the test neatly partitions the genome into two segments of roughly equal length with seemingly opposite orientations. This corresponds to the segments between the DNA replication origin and terminus, which were previously known to have distinct nucleotide compositions. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. After inverting the sequence between the repeat, the balance of the misassembled genome is restored. Our method identifies 31 potential misassemblies in the NCBI database, several of which are further supported by a reassembly of the read data. AVAILABILITY AND IMPLEMENTATION A github repository is available at https://github.com/gcgreenberg/Oriented-TNF.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Grant Greenberg
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
| |
Collapse
|
4
|
Jiang Z, Li X, Guo L. MetaCRS: unsupervised clustering of contigs with the recursive strategy of reducing metagenomic dataset's complexity. BMC Bioinformatics 2022; 22:315. [PMID: 35045830 PMCID: PMC8772042 DOI: 10.1186/s12859-021-04227-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 06/01/2021] [Indexed: 01/02/2023] Open
Abstract
Background Metagenomics technology can directly extract microbial genetic material from the environmental samples to obtain their sequencing reads, which can be further assembled into contigs through assembly tools. Clustering methods of contigs are subsequently applied to recover complete genomes from environmental samples. The main problems with current clustering methods are that they cannot recover more high-quality genes from complex environments. Firstly, there are multiple strains under the same species, resulting in assembly of chimeras. Secondly, different strains under the same species are difficult to be classified. Thirdly, it is difficult to determine the number of strains during the clustering process. Results In view of the shortcomings of current clustering methods, we propose an unsupervised clustering method which can improve the ability to recover genes from complex environments and a new method for selecting the number of sample’s strains in clustering process. The sequence composition characteristics (tetranucleotide frequency) and co-abundance are combined to train the probability model for clustering. A new recursive method that can continuously reduce the complexity of the samples is proposed to improve the ability to recover genes from complex environments. The new clustering method was tested on both simulated and real metagenomic datasets, and compared with five state-of-the-art methods including CONCOCT, Maxbin2.0, MetaBAT, MyCC and COCACOLA. In terms of the number and quality of recovered genes from metagenomic datasets, the results show that our proposed method is more effective. Conclusions A new contigs clustering method is proposed, which can recover more high-quality genes from complex environmental samples.
Collapse
Affiliation(s)
- Zhongjun Jiang
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| | - Xiaobo Li
- College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua, 321004, China. .,College of Engineering, Lishui University, Lishui, 323000, China.
| | - Lijun Guo
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| |
Collapse
|
5
|
Kirzhner V, Toledano-Kitai D, Volkovich Z. Evaluating the number of different genomes in a metagenome by means of the compositional spectra approach. PLoS One 2020; 15:e0237205. [PMID: 33156862 PMCID: PMC7647110 DOI: 10.1371/journal.pone.0237205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 10/22/2020] [Indexed: 01/02/2023] Open
Abstract
Determination of metagenome composition is still one of the most interesting problems of bioinformatics. It involves a wide range of mathematical methods, from probabilistic models of combinatorics to cluster analysis and pattern recognition techniques. The successful advance of rapid sequencing methods and fast and precise metagenome analysis will increase the diagnostic value of healthy or pathological human metagenomes. The article presents the theoretical foundations of the algorithm for calculating the number of different genomes in the medium under study. The approach is based on analysis of the compositional spectra of subsequently sequenced samples of the medium. Its essential feature is using random fluctuations in the bacteria number in different samples of the same metagenome. The possibility of effective implementation of the algorithm in the presence of data errors is also discussed. In the work, the algorithm of a metagenome evaluation is described, including the estimation of the genome number and the identification of the genomes with known compositional spectra. It should be emphasized that evaluating the genome number in a metagenome can be always helpful, regardless of the metagenome separation techniques, such as clustering the sequencing results or marker analysis.
Collapse
Affiliation(s)
- Valery Kirzhner
- Institute of Evolution, University of Haifa, Haifa, Israel
- * E-mail:
| | - Dvora Toledano-Kitai
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| | - Zeev Volkovich
- Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel
| |
Collapse
|
6
|
Abe T, Akazawa Y, Toyoda A, Niki H, Baba T. Batch-Learning Self-Organizing Map Identifies Horizontal Gene Transfer Candidates and Their Origins in Entire Genomes. Front Microbiol 2020; 11:1486. [PMID: 32719664 PMCID: PMC7350273 DOI: 10.3389/fmicb.2020.01486] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 06/08/2020] [Indexed: 02/05/2023] Open
Abstract
Horizontal gene transfer (HGT) has been widely suggested to play a critical role in the environmental adaptation of microbes; however, the number and origin of the genes in microbial genomes obtained through HGT remain unknown as the frequency of detected HGT events is generally underestimated, particularly in the absence of information on donor sequences. As an alternative to phylogeny-based methods that rely on sequence alignments, we have developed an alignment-free clustering method on the basis of an unsupervised neural network “Batch-Learning Self-Organizing Map (BLSOM)” in which sequence fragments are clustered based solely on oligonucleotide similarity without taxonomical information, to detect HGT candidates and their origin in entire genomes. By mapping the microbial genomic sequences on large-scale BLSOMs constructed with nearly all prokaryotic genomes, HGT candidates can be identified, and their origin assigned comprehensively, even for microbial genomes that exhibit high novelty. By focusing on two types of Alphaproteobacteria, specifically psychrotolerant Sphingomonas strains from an Antarctic lake, we detected HGT candidates using BLSOM and found higher proportions of HGT candidates from organisms belonging to Betaproteobacteria in the genomes of these two Antarctic strains compared with those of continental strains. Further, an origin difference was noted in the HGT candidates found in the two Antarctic strains. Although their origins were highly diversified, gene functions related to the cell wall or membrane biogenesis were shared among the HGT candidates. Moreover, analyses of amino acid frequency suggested that housekeeping genes and some HGT candidates of the Antarctic strains exhibited different characteristics to other continental strains. Lys, Ser, Thr, and Val were the amino acids found to be increased in the Antarctic strains, whereas Ala, Arg, Glu, and Leu were decreased. Our findings strongly suggest a low-temperature adaptation process for microbes that may have arisen convergently as an independent evolutionary strategy in each Antarctic strain. Hence, BLSOM analysis could serve as a powerful tool in not only detecting HGT candidates and their origins in entire genomes, but also in providing novel perspectives into the environmental adaptations of microbes.
Collapse
Affiliation(s)
- Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata, Japan
| | - Yu Akazawa
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata, Japan
| | - Atsushi Toyoda
- Comparative Genomics Laboratory, National Institute of Genetics, Mishima, Japan.,Advanced Genomics Center, National Institute of Genetics, Mishima, Japan
| | - Hironori Niki
- Microbial Physiology Laboratory, National Institute of Genetics, Mishima, Japan
| | - Tomoya Baba
- Advanced Genomics Center, National Institute of Genetics, Mishima, Japan.,Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tokyo, Japan
| |
Collapse
|
7
|
Tokuda M, Suzuki H, Yanagiya K, Yuki M, Inoue K, Ohkuma M, Kimbara K, Shintani M. Determination of Plasmid pSN1216-29 Host Range and the Similarity in Oligonucleotide Composition Between Plasmid and Host Chromosomes. Front Microbiol 2020; 11:1187. [PMID: 32582111 PMCID: PMC7296055 DOI: 10.3389/fmicb.2020.01187] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 05/11/2020] [Indexed: 12/17/2022] Open
Abstract
Plasmids are extrachromosomal DNA that can be horizontally transferred between different bacterial cells by conjugation. Horizontal gene transfer of plasmids can promote rapid evolution and adaptation of bacteria by imparting various traits involved in antibiotic resistance, virulence, and metabolism to their hosts. The host range of plasmids is an important feature for understanding how they spread in environmental microbial communities. Earlier bioinformatics studies have demonstrated that plasmids are likely to have similar oligonucleotide (k-mer) compositions to their host chromosomes and that evolutionary host ranges of plasmids could be predicted from this similarity. However, there are no complementary studies to assess the consistency between the predicted evolutionary host range and experimentally determined replication/transfer host range of a plasmid. In the present study, the replication/transfer host range of a model plasmid, pSN1216-29, exogenously isolated from cow manure as a newly discovered self-transmissible plasmid, was experimentally determined within microbial communities extracted from soil and cow manure. In silico prediction of evolutionary host range was performed with the pSN1216-29 using its oligonucleotide compositions independently. The results showed that oligonucleotide compositions of the plasmid pSN1216-29 had more similarities to those of hosts (transconjugants genera) than those of non-hosts (other genera). These findings can contribute to the understanding of how plasmids behave in microbial communities, and aid in the designing of appropriate plasmid vectors for different bacteria.
Collapse
Affiliation(s)
- Maho Tokuda
- Applied Chemistry and Biochemical Engineering Course, Department of Engineering, Graduate School of Integrated Science and Technology, Shizuoka University, Shizuoka, Japan
| | - Haruo Suzuki
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan.,Faculty of Environment and Information Studies, Keio University, Fujisawa, Japan
| | - Kosuke Yanagiya
- Applied Chemistry and Biochemical Engineering Course, Department of Engineering, Graduate School of Integrated Science and Technology, Shizuoka University, Shizuoka, Japan
| | - Masahiro Yuki
- Japan Collection of Microorganisms, RIKEN BioResource Research Center, Tsukuba, Japan
| | - Kengo Inoue
- Faculty of Agriculture, University of Miyazaki, Miyazaki, Japan
| | - Moriya Ohkuma
- Japan Collection of Microorganisms, RIKEN BioResource Research Center, Tsukuba, Japan
| | - Kazuhide Kimbara
- Applied Chemistry and Biochemical Engineering Course, Department of Engineering, Graduate School of Integrated Science and Technology, Shizuoka University, Shizuoka, Japan
| | - Masaki Shintani
- Applied Chemistry and Biochemical Engineering Course, Department of Engineering, Graduate School of Integrated Science and Technology, Shizuoka University, Shizuoka, Japan.,Japan Collection of Microorganisms, RIKEN BioResource Research Center, Tsukuba, Japan.,Research Institute of Green Science and Technology, Shizuoka University, Shizuoka, Japan
| |
Collapse
|
8
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
9
|
Celis JS, Wibberg D, Ramírez-Portilla C, Rupp O, Sczyrba A, Winkler A, Kalinowski J, Wilke T. Binning enables efficient host genome reconstruction in cnidarian holobionts. Gigascience 2018; 7:5039706. [PMID: 29917104 PMCID: PMC6049006 DOI: 10.1093/gigascience/giy075] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Accepted: 06/14/2018] [Indexed: 12/19/2022] Open
Abstract
Background Many cnidarians, including stony corals, engage in complex symbiotic associations, comprising the eukaryotic host, photosynthetic algae, and highly diverse microbial communities—together referred to as holobiont. This taxonomic complexity makes sequencing and assembling coral host genomes extremely challenging. Therefore, previous cnidarian genomic projects were based on symbiont-free tissue samples. However, this approach may not be applicable to the majority of cnidarian species for ecological reasons. We therefore evaluated the performance of an alternative method based on sequence binning for reconstructing the genome of the stony coral Porites rus from a hologenomic sample and compared it to traditional approaches. Results Our results demonstrate that binning performs well for hologenomic data, producing sufficient reads for assembling the draft genome of P. rus. An assembly evaluation based on operational criteria showed results that were comparable to symbiont-free approaches in terms of completeness and usefulness, despite a high degree of fragmentation in our assembly. In addition, we found that binning provides sufficient data for exploratory k-mer estimation of genomic features, such as genome size and heterozygosity. Conclusions Binning constitutes a powerful approach for disentangling taxonomically complex coral hologenomes. Considering the recent decline of coral reefs on the one hand and previous limitations to coral genome sequencing on the other hand, binning may facilitate rapid and reliable genome assembly. This study also provides an important milestone in advancing binning from the metagenomic to the hologenomic and from the prokaryotic to the eukaryotic level.
Collapse
Affiliation(s)
- Juan Sebastián Celis
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Corporation Center of Excellence in Marine Sciences, Cra 54 No 106-18, Bogotá, Colombia
| | - Daniel Wibberg
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Catalina Ramírez-Portilla
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Evolutionary Biology and Ecology, Université libre de Bruxelles, Av. Franklin D. Roosevelt 50, CP 160/12, B-1050 Brussels, Belgium
| | - Oliver Rupp
- Bioinformatics and Systems Biology, Justus Liebig University Giessen, Heinrich-Buff-Ring 58, 35392 Giessen, Germany
| | - Alexander Sczyrba
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Anika Winkler
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Jörn Kalinowski
- Center for Biotechnology, Bielefeld University, Universitätsstraße 27, 33615 Bielefeld, Germany
| | - Thomas Wilke
- Animal Ecology and Systematics, Justus Liebig University Giessen. Heinrich-Buff-Ring 26-32 (IFZ), 35392 Giessen, Germany.,Corporation Center of Excellence in Marine Sciences, Cra 54 No 106-18, Bogotá, Colombia
| |
Collapse
|
10
|
Serrano-Solís V, Toscano Soares PE, de Farías ST. Genomic Signatures Among Acanthamoeba polyphaga Entoorganisms Unveil Evidence of Coevolution. J Mol Evol 2018; 87:7-15. [PMID: 30456441 DOI: 10.1007/s00239-018-9877-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 11/09/2018] [Indexed: 11/30/2022]
Abstract
The definition of a genomic signature (GS) is "the total net response to selective pressure". Recent isolation and sequencing of naturally occurring organisms, hereby named entoorganisms, within Acanthamoeba polyphaga, raised the hypothesis of a common genomic signature despite their diverse and unrelated evolutionary origin. Widely accepted and implemented tests for GS detection are oligonucleotide relative frequencies (OnRF) and relative codon usage (RCU) surveys. A common pattern and strong correlations were unveiled from OnRFs among A. polyphaga's Mimivirus and virophage Sputnik. RCU showed a common A-T bias at third codon position. We expanded tests to the amoebal mitochondrial genome and amoeba-resistant bacteria, achieving strikingly coherent results to the aforementioned viral analyses. The GSs in these entoorganisms of diverse evolutionary origin are coevolutionarily conserved within an intracellular environment that provides sanctuary for species of ecological and biomedical relevance.
Collapse
Affiliation(s)
- Víctor Serrano-Solís
- Laboratório de Genética Evolutiva Paulo Leminsk, Departamento de Biologia Molecular, Centro de Ciencias Exatas e da Natureza, Universidade Federal da Paraíba, João Pessoa, Brazil.
| | - Paulo Eduardo Toscano Soares
- Laboratório de Genética Evolutiva Paulo Leminsk, Departamento de Biologia Molecular, Centro de Ciencias Exatas e da Natureza, Universidade Federal da Paraíba, João Pessoa, Brazil
| | - Sávio T de Farías
- Laboratório de Genética Evolutiva Paulo Leminsk, Departamento de Biologia Molecular, Centro de Ciencias Exatas e da Natureza, Universidade Federal da Paraíba, João Pessoa, Brazil
| |
Collapse
|
11
|
MetLab: An In Silico Experimental Design, Simulation and Analysis Tool for Viral Metagenomics Studies. PLoS One 2016; 11:e0160334. [PMID: 27479078 PMCID: PMC4968819 DOI: 10.1371/journal.pone.0160334] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 07/18/2016] [Indexed: 02/07/2023] Open
Abstract
Metagenomics, the sequence characterization of all genomes within a sample, is widely used as a virus discovery tool as well as a tool to study viral diversity of animals. Metagenomics can be considered to have three main steps; sample collection and preparation, sequencing and finally bioinformatics. Bioinformatic analysis of metagenomic datasets is in itself a complex process, involving few standardized methodologies, thereby hampering comparison of metagenomics studies between research groups. In this publication the new bioinformatics framework MetLab is presented, aimed at providing scientists with an integrated tool for experimental design and analysis of viral metagenomes. MetLab provides support in designing the metagenomics experiment by estimating the sequencing depth needed for the complete coverage of a species. This is achieved by applying a methodology to calculate the probability of coverage using an adaptation of Stevens’ theorem. It also provides scientists with several pipelines aimed at simplifying the analysis of viral metagenomes, including; quality control, assembly and taxonomic binning. We also implement a tool for simulating metagenomics datasets from several sequencing platforms. The overall aim is to provide virologists with an easy to use tool for designing, simulating and analyzing viral metagenomes. The results presented here include a benchmark towards other existing software, with emphasis on detection of viruses as well as speed of applications. This is packaged, as comprehensive software, readily available for Linux and OSX users at https://github.com/norling/metlab.
Collapse
|
12
|
Kang DD, Rubin EM, Wang Z. Reconstructing single genomes from complex microbial communities. ACTA ACUST UNITED AC 2016. [DOI: 10.1515/itit-2016-0011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Abstract
High throughput next generation sequencing technologies have enabled cultivation-independent approaches to study microbial
communities in environmental samples. To date much of functional metagenomics has been limited to the gene or pathway
level. Recent breakthroughs in metagenome binning have made it feasible to reconstruct high quality, individual microbial
genomes from complex communities with thousands of species. In this review we aim to compare several automated metagenome
binning software tools for their performance, and provide a practical guide for the metagenomics research community to
carry out successful binning analyses.
Collapse
Affiliation(s)
- Dongwan D. Kang
- Joint Genome Institute, Lawrence Berkeley National Laboratory, DOE, Walnut Creek, CA 94598, USA
| | - Edward M. Rubin
- Joint Genome Institute, Lawrence Berkeley National Laboratory, DOE, Walnut Creek, CA 94598, USA
| | | |
Collapse
|
13
|
Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 2015; 3:e1165. [PMID: 26336640 PMCID: PMC4556158 DOI: 10.7717/peerj.1165] [Citation(s) in RCA: 1185] [Impact Index Per Article: 118.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 07/17/2015] [Indexed: 12/19/2022] Open
Abstract
Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. It automatically forms hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.
Collapse
Affiliation(s)
- Dongwan D Kang
- Department of Energy Joint Genome Institute , Walnut Creek, CA , USA ; Genomics Division, Lawrence Berkeley National Laboratory , Berkeley, CA , USA
| | - Jeff Froula
- Department of Energy Joint Genome Institute , Walnut Creek, CA , USA ; Genomics Division, Lawrence Berkeley National Laboratory , Berkeley, CA , USA
| | - Rob Egan
- Department of Energy Joint Genome Institute , Walnut Creek, CA , USA ; Genomics Division, Lawrence Berkeley National Laboratory , Berkeley, CA , USA
| | - Zhong Wang
- Department of Energy Joint Genome Institute , Walnut Creek, CA , USA ; Genomics Division, Lawrence Berkeley National Laboratory , Berkeley, CA , USA ; School of Natural Sciences, University of California at Merced , Merced, CA , USA
| |
Collapse
|
14
|
Yin C, Yau SST. An improved model for whole genome phylogenetic analysis by Fourier transform. J Theor Biol 2015; 382:99-110. [PMID: 26151589 DOI: 10.1016/j.jtbi.2015.06.033] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 06/19/2015] [Accepted: 06/22/2015] [Indexed: 01/07/2023]
Abstract
DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
15
|
Necessary relations for nucleotide frequencies. J Theor Biol 2015; 374:179-82. [PMID: 25843217 DOI: 10.1016/j.jtbi.2015.03.025] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Revised: 02/01/2015] [Accepted: 03/21/2015] [Indexed: 11/21/2022]
Abstract
Genome composition analysis of di-, tri- and tetra-nucleotide frequencies is known to be evolutionarily informative, and useful in metagenomic studies, where binning of raw sequence data is often an important first step. Patterns appearing in genome composition analysis may be due to evolutionary processes or purely mathematical relations. For example, the total number of dinucleotides in a sequence is equal to the sum of the individual totals of the sixteen types of dinucleotide, and this is entirely independent of any assumptions made regarding mutation or selection, or indeed any physical or chemical process. Before any statistical analysis can be attempted, a knowledge of all necessary mathematical relations is required. I show that 25% of di-, tri- and tetra-nucleotide frequencies can be written as simple sums and differences of the remainder. The vast majority of organisms have circular genomes, for which these relations are exact and necessary. In the case of linear molecules, the absolute error is very nearly zero, and does not grow with contiguous sequence length. As a result of the new, necessary relations presented here, the foundations of the statistical analysis of di-, tri- and tetra-nucleotide frequencies, and k-mer analysis in general, need to be revisited.
Collapse
|
16
|
Abstract
Traditionally, microbial genome sequencing has been restricted to the small number of species that can be grown in pure culture. The progressive development of culture-independent methods over the last 15 years now allows researchers to sequence microbial communities directly from environmental samples. This approach is commonly referred to as "metagenomics" or "community genomics". However, the term metagenomics is applied liberally in the literature to describe any culture-independent analysis of microbial communities. Here, we define metagenomics as shotgun ("random") sequencing of the genomic DNA of a sample taken directly from the environment. The metagenome can be thought of as a sampling of the collective genome of the microbial community. We outline the considerations and analyses that should be undertaken to ensure the success of a metagenomic sequencing project, including the choice of sequencing platform and methods for assembly, binning, annotation, and comparative analysis.
Collapse
Affiliation(s)
- Lauren Bragg
- Advanced Water Management Centre, The University of Queensland, St. Lucia, QLD, Australia
| | | |
Collapse
|
17
|
Thompson CC, Chimetto L, Edwards RA, Swings J, Stackebrandt E, Thompson FL. Microbial genomic taxonomy. BMC Genomics 2013; 14:913. [PMID: 24365132 PMCID: PMC3879651 DOI: 10.1186/1471-2164-14-913] [Citation(s) in RCA: 263] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2013] [Accepted: 12/18/2013] [Indexed: 01/23/2023] Open
Abstract
A need for a genomic species definition is emerging from several independent studies worldwide. In this commentary paper, we discuss recent studies on the genomic taxonomy of diverse microbial groups and a unified species definition based on genomics. Accordingly, strains from the same microbial species share >95% Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI), >95% identity based on multiple alignment genes, <10 in Karlin genomic signature, and > 70% in silico Genome-to-Genome Hybridization similarity (GGDH). Species of the same genus will form monophyletic groups on the basis of 16S rRNA gene sequences, Multilocus Sequence Analysis (MLSA) and supertree analysis. In addition to the established requirements for species descriptions, we propose that new taxa descriptions should also include at least a draft genome sequence of the type strain in order to obtain a clear outlook on the genomic landscape of the novel microbe. The application of the new genomic species definition put forward here will allow researchers to use genome sequences to define simultaneously coherent phenotypic and genomic groups.
Collapse
Affiliation(s)
- Cristiane C Thompson
- Institute of Biology, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil.
| | | | | | | | | | | |
Collapse
|
18
|
Patil KR, McHardy AC. Alignment-free genome tree inference by learning group-specific distance metrics. Genome Biol Evol 2013; 5:1470-84. [PMID: 23843191 PMCID: PMC3762195 DOI: 10.1093/gbe/evt105] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Understanding the evolutionary relationships between organisms is vital for their in-depth study. Gene-based methods are often used to infer such relationships, which are not without drawbacks. One can now attempt to use genome-scale information, because of the ever increasing number of genomes available. This opportunity also presents a challenge in terms of computational efficiency. Two fundamentally different methods are often employed for sequence comparisons, namely alignment-based and alignment-free methods. Alignment-free methods rely on the genome signature concept and provide a computationally efficient way that is also applicable to nonhomologous sequences. The genome signature contains evolutionary signal as it is more similar for closely related organisms than for distantly related ones. We used genome-scale sequence information to infer taxonomic distances between organisms without additional information such as gene annotations. We propose a method to improve genome tree inference by learning specific distance metrics over the genome signature for groups of organisms with similar phylogenetic, genomic, or ecological properties. Specifically, our method learns a Mahalanobis metric for a set of genomes and a reference taxonomy to guide the learning process. By applying this method to more than a thousand prokaryotic genomes, we showed that, indeed, better distance metrics could be learned for most of the 18 groups of organisms tested here. Once a group-specific metric is available, it can be used to estimate the taxonomic distances for other sequenced organisms from the group. This study also presents a large scale comparison between 10 methods--9 alignment-free and 1 alignment-based.
Collapse
Affiliation(s)
- Kaustubh R Patil
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany.
| | | |
Collapse
|
19
|
Phan TH, Nguyen DL. Species-specificity of DNA trimer densities in chromosomes and their use in the classification of closely related organisms. J Microbiol Methods 2012; 91:30-7. [PMID: 22820348 DOI: 10.1016/j.mimet.2012.07.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2012] [Revised: 07/09/2012] [Accepted: 07/10/2012] [Indexed: 11/27/2022]
Abstract
16S rDNA sequences are conventionally used for classification of organisms. However, the use of these sequences is sometimes not successful, especially for closely related species. For better classification of these organisms, several methods that are genome sequence-based have been developed. Sequence alignment-based methods are tedious and time-consuming, as they need conserved coding sequences to be identified and deduced prior to sequence alignment. Likewise, method that relies on gene function needs genes to be assessed for function similarity. Other alignment-free methods, which are based on particular genome sequence properties, so far have been complex and not species-specific enough for classification of organisms below genus level. The present study found that the ratios of DNA trimer frequencies to chromosomal length were species-specific. Density of a trimer in a chromosomal sequence was defined as the average frequency of the trimer per 1 kbp. The species-specificity of trimer densities in chromosomes of many closely related bacteria was compared in parallel with 16S rDNA sequences in these same bacteria. The results of these comparisons indicate that trimer densities in chromosomes can be used to simply and efficiently classify the organisms below genus level.
Collapse
Affiliation(s)
- Thi Huyen Phan
- Department of Biotechnology, Ho Chi Minh City University of Technology, VNU-HCM, Ward 14, District 10, Ho Chi Minh City, Vietnam.
| | | |
Collapse
|
20
|
Abel J, Mrázek J. Differences in DNA curvature-related sequence periodicity between prokaryotic chromosomes and phages, and relationship to chromosomal prophage content. BMC Genomics 2012; 13:188. [PMID: 22587570 PMCID: PMC3431218 DOI: 10.1186/1471-2164-13-188] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Accepted: 05/07/2012] [Indexed: 02/07/2023] Open
Abstract
Background Periodic spacing of A-tracts (short runs of A or T) with the DNA helical period of ~10–11 bp is characteristic of intrinsically bent DNA. In eukaryotes, the DNA bending is related to chromatin structure and nucleosome positioning. However, the physiological role of strong sequence periodicity detected in many prokaryotic genomes is not clear. Results We developed measures of intensity and persistency of DNA curvature-related sequence periodicity and applied them to prokaryotic chromosomes and phages. The results indicate that strong periodic signals present in chromosomes are generally absent in phage genomes. Moreover, chromosomes containing prophages are less likely to possess a persistent periodic signal than chromosomes with no prophages. Conclusions Absence of DNA curvature-related sequence periodicity in phages could arise from constraints associated with DNA packaging in the viral capsid. Lack of prophages in chromosomes with persistent periodic signal suggests that the sequence periodicity and concomitant DNA curvature could play a role in protecting the chromosomes from integration of phage DNA.
Collapse
Affiliation(s)
- Jacob Abel
- Department of Microbiology, University of Georgia, Athens, GA 30602, USA
| | | |
Collapse
|
21
|
Dutta C, Paul S. Microbial lifestyle and genome signatures. Curr Genomics 2012; 13:153-62. [PMID: 23024607 PMCID: PMC3308326 DOI: 10.2174/138920212799860698] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2011] [Revised: 09/13/2011] [Accepted: 09/28/2011] [Indexed: 12/29/2022] Open
Abstract
Microbes are known for their unique ability to adapt to varying lifestyle and environment, even to the extreme or adverse ones. The genomic architecture of a microbe may bear the signatures not only of its phylogenetic position, but also of the kind of lifestyle to which it is adapted. The present review aims to provide an account of the specific genome signatures observed in microbes acclimatized to distinct lifestyles or ecological niches. Niche-specific signatures identified at different levels of microbial genome organization like base composition, GC-skew, purine-pyrimidine ratio, dinucleotide abundance, codon bias, oligonucleotide composition etc. have been discussed. Among the specific cases highlighted in the review are the phenomena of genome shrinkage in obligatory host-restricted microbes, genome expansion in strictly intra-amoebal pathogens, strand-specific codon usage in intracellular species, acquisition of genome islands in pathogenic or symbiotic organisms, discriminatory genomic traits of marine microbes with distinct trophic strategies, and conspicuous sequence features of certain extremophiles like those adapted to high temperature or high salinity.
Collapse
Affiliation(s)
- Chitra Dutta
- Structural Biology & Bioinformatics Division, CSIR- Indian Institute of Chemical Biology, 4, Raja S. C. Mullick Road, Kolkata 700032, India
| | | |
Collapse
|
22
|
Comparative analyses of base compositions, DNA sizes, and dinucleotide frequency profiles in archaeal and bacterial chromosomes and plasmids. INTERNATIONAL JOURNAL OF EVOLUTIONARY BIOLOGY 2012; 2012:342482. [PMID: 22536540 PMCID: PMC3321278 DOI: 10.1155/2012/342482] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2011] [Revised: 01/11/2012] [Accepted: 01/19/2012] [Indexed: 01/16/2023]
Abstract
In the present paper, I compared guanine-cytosine (GC) contents, DNA sizes, and dinucleotide frequency profiles in 109 archaeal chromosomes, 59 archaeal plasmids, 1379 bacterial chromosomes, and 854 bacterial plasmids. In more than 80% of archaeal and bacterial plasmids, the GC content was lower than that of the host chromosome. Furthermore, most of the differences in GC content found between a plasmid and its host chromosome were less than 10%, and the GC content in plasmids and host chromosomes was highly correlated (Pearson's correlation coefficient r = 0.965 in bacteria and 0.917 in archaea). These results support the hypothesis that horizontal gene transfers have occurred frequently via plasmid distribution during evolution. GC content and chromosome size were more highly correlated in bacteria (r = 0.460) than in archaea (r = 0.195). Interestingly, there was a tendency for archaea with plasmids to have higher GC content in the chromosome and plasmid than those without plasmids. Thus, the dinucleotide frequency profile of the archaeal plasmids has a bias toward high GC content.
Collapse
|
23
|
Genome Signature Difference between Deinococcus radiodurans and Thermus thermophilus. INTERNATIONAL JOURNAL OF EVOLUTIONARY BIOLOGY 2012; 2012:205274. [PMID: 22500246 PMCID: PMC3303625 DOI: 10.1155/2012/205274] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2011] [Accepted: 12/08/2011] [Indexed: 01/13/2023]
Abstract
The extremely radioresistant bacteria of the genus Deinococcus and the extremely thermophilic bacteria of the genus Thermus belong to a common taxonomic group. Considering the distinct living environments of Deinococcus and Thermus, different genes would have been acquired through horizontal gene transfer after their divergence from a common ancestor. Their guanine-cytosine (GC) contents are similar; however, we hypothesized that their genomic signatures would be different. Our findings indicated that the genomes of Deinococcus radiodurans and Thermus thermophilus have different tetranucleotide frequencies. This analysis showed that the genome signature of D. radiodurans is most similar to that of Pseudomonas aeruginosa, whereas the genome signature of T. thermophilus is most similar to that of Thermanaerovibrio acidaminovorans. This difference in genome signatures may be related to the different evolutionary backgrounds of the 2 genera after their divergence from a common ancestor.
Collapse
|
24
|
Saeed I, Tang SL, Halgamuge SK. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition. Nucleic Acids Res 2011; 40:e34. [PMID: 22180538 PMCID: PMC3300000 DOI: 10.1093/nar/gkr1204] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis.
Collapse
Affiliation(s)
- Isaam Saeed
- MERIT Theme: Biomedical Engineering, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, VIC 3010, Australia.
| | | | | |
Collapse
|
25
|
Mrázek J, Chaudhari T, Basu A. PerPlot & PerScan: tools for analysis of DNA curvature-related periodicity in genomic nucleotide sequences. MICROBIAL INFORMATICS AND EXPERIMENTATION 2011; 1:13. [PMID: 22587738 PMCID: PMC3372288 DOI: 10.1186/2042-5783-1-13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2011] [Accepted: 11/28/2011] [Indexed: 04/12/2023]
Abstract
Background Periodic spacing of short adenine or thymine runs phased with DNA helical period of ~10.5 bp is associated with intrinsic DNA curvature and deformability, which play important roles in DNA-protein interactions and in the organization of chromosomes in both eukaryotes and prokaryotes. Local differences in DNA sequence periodicity have been linked to differences in gene expression in some organisms. Despite the significance of these periodic patterns, there are virtually no publicly accessible tools for their analysis. Results We present novel tools suitable for assessments of DNA curvature-related sequence periodicity in nucleotide sequences at the genome scale. Utility of the present software is demonstrated on a comparison of sequence periodicities in the genomes of Haemophilus influenzae, Methanocaldococcus jannaschii, Saccharomyces cerevisiae, and Arabidopsis thaliana. The software can be accessed through a web interface and the programs are also available for download. Conclusions The present software is suitable for comparing DNA curvature-related sequence periodicity among different genomes as well as for analysis of intrachromosomal heterogeneity of the sequence periodicity. It provides a quick and convenient way to detect anomalous regions of chromosomes that could have unusual structural and functional properties and/or distinct evolutionary history.
Collapse
Affiliation(s)
- Jan Mrázek
- Department of Microbiology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602-2605, USA.
| | | | | |
Collapse
|
26
|
Norberg P, Bergström M, Jethava V, Dubhashi D, Hermansson M. The IncP-1 plasmid backbone adapts to different host bacterial species and evolves through homologous recombination. Nat Commun 2011; 2:268. [PMID: 21468020 PMCID: PMC3104523 DOI: 10.1038/ncomms1267] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Accepted: 03/08/2011] [Indexed: 01/24/2023] Open
Abstract
Plasmids are important members of the bacterial mobile gene pool, and are among the most important contributors to horizontal gene transfer between bacteria. They typically harbour a wide spectrum of host beneficial traits, such as antibiotic resistance, inserted into their backbones. Although these inserted elements have drawn considerable interest, evolutionary information about the plasmid backbones, which encode plasmid related traits, is sparse. Here we analyse 25 complete backbone genomes from the broad-host-range IncP-1 plasmid family. Phylogenetic analysis reveals seven clades, in which two plasmids that we isolated from a marine biofilm represent a novel clade. We also found that homologous recombination is a prominent feature of the plasmid backbone evolution. Analysis of genomic signatures indicates that the plasmids have adapted to different host bacterial species. Globally circulating IncP-1 plasmids hence contain mosaic structures of segments derived from several parental plasmids that have evolved in, and adapted to, different, phylogenetically very distant host bacterial species. Plasmids are present in many bacteria and are often transferred between different species causing horizontal gene transfer. By comparing the sequences of 25 plasmid DNA backbones, the authors show that homologous recombination is prevalent in plasmids and that the plasmids have adapted to persist in different host bacteria.
Collapse
Affiliation(s)
- Peter Norberg
- Department of Cell and Molecular Biology, Microbiology, University of Gothenburg, Box 462, SE 413 46, Gothenburg, Sweden.
| | | | | | | | | |
Collapse
|
27
|
Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 2010; 11:544. [PMID: 21044341 PMCID: PMC3098094 DOI: 10.1186/1471-2105-11-544] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2010] [Accepted: 11/02/2010] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. RESULTS We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. CONCLUSIONS SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.
Collapse
Affiliation(s)
- David R Kelley
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, A.V. Williams Building College Park, MD 20742, USA
| | - Steven L Salzberg
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, A.V. Williams Building College Park, MD 20742, USA
| |
Collapse
|
28
|
Comparative analysis of sequence periodicity among prokaryotic genomes points to differences in nucleoid structure and a relationship to gene expression. J Bacteriol 2010; 192:3763-72. [PMID: 20494989 DOI: 10.1128/jb.00149-10] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Regular spacing of short runs of A or T nucleotides in DNA sequences with a period close to the helical period of the DNA double helix has been associated with intrinsic DNA bending and nucleosome positioning in eukaryotes. Analogous periodic signals were also observed in prokaryotic genomes. While the exact role of this periodicity in prokaryotes is not known, it has been proposed to facilitate the DNA packaging in the prokaryotic nucleoid and/or to promote negative or positive supercoiling. We developed a methodology for assessments of intragenomic heterogeneity of these periodic patterns and applied it in analysis of 1,025 prokaryotic chromosomes. This technique allows more detailed analysis of sequence periodicity than previous methods where sequence periodicity was assessed in an integral form across the whole chromosome. We found that most genomes have the periodic signal confined to several chromosomal segments while most of the chromosome lacks a strong sequence periodicity. Moreover, there are significant differences among different prokaryotes in both the intensity and persistency of sequence periodicity related to DNA curvature. We proffer that the prokaryotic nucleoid consists of relatively rigid sections stabilized by short intrinsically bent DNA segments and characterized by locally strong periodic patterns alternating with regions featuring a weak periodic signal, which presumably permits higher structural flexibility. This model applies to most bacteria and archaea. In genomes with an exceptionally persistent periodic signal, highly expressed genes tend to concentrate in aperiodic sections, suggesting that structural heterogeneity of the nucleoid is related to local differences in transcriptional activity.
Collapse
|
29
|
Chaffron S, Rehrauer H, Pernthaler J, von Mering C. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res 2010; 20:947-59. [PMID: 20458099 DOI: 10.1101/gr.104521.109] [Citation(s) in RCA: 308] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Microbes are the most abundant and diverse organisms on Earth. In contrast to macroscopic organisms, their environmental preferences and ecological interdependencies remain difficult to assess, requiring laborious molecular surveys at diverse sampling sites. Here, we present a global meta-analysis of previously sampled microbial lineages in the environment. We grouped publicly available 16S ribosomal RNA sequences into operational taxonomic units at various levels of resolution and systematically searched these for co-occurrence across environments. Naturally occurring microbes, indeed, exhibited numerous, significant interlineage associations. These ranged from relatively specific groupings encompassing only a few lineages, to larger assemblages of microbes with shared habitat preferences. Many of the coexisting lineages were phylogenetically closely related, but a significant number of distant associations were observed as well. The increased availability of completely sequenced genomes allowed us, for the first time, to search for genomic correlates of such ecological associations. Genomes from coexisting microbes tended to be more similar than expected by chance, both with respect to pathway content and genome size, and outliers from these trends are discussed. We hypothesize that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.
Collapse
Affiliation(s)
- Samuel Chaffron
- Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, CH-8057 Zürich, Switzerland
| | | | | | | |
Collapse
|
30
|
Perry SC, Beiko RG. Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives. Genome Biol Evol 2010; 2:117-31. [PMID: 20333228 PMCID: PMC2839357 DOI: 10.1093/gbe/evq004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/19/2010] [Indexed: 01/23/2023] Open
Abstract
It is well known that patterns of nucleotide composition vary within and among
genomes, although the reasons why these variations exist are not completely
understood. Between-genome compositional variation has been exploited to assign
environmental shotgun sequences to their most likely originating genomes,
whereas within-genome variation has been used to identify recently acquired
genetic material such as pathogenicity islands. Recent sequence assignment
techniques have achieved high levels of accuracy on artificial data sets, but
the relative difficulty of distinguishing lineages with varying degrees of
relatedness, and different types of genomic sequence, has not been examined in
depth. We investigated the compositional differences in a set of 774 sequenced
microbial genomes, finding rapid divergence among closely related genomes, but
also convergence of compositional patterns among genomes with similar habitats.
Support vector machines were then used to distinguish all pairs of genomes based
on genome fragments 500 nucleotides in length. The nearly 300,000 accuracy
scores obtained from these trials were used to construct general models of
distinguishability versus taxonomic and compositional indices of genomic
divergence. Unusual genome pairs were evident from their large residuals
relative to the fitted model, and we identified several factors including genome
reduction, putative lateral genetic transfer, and habitat convergence that
influence the distinguishability of genomes. The positional, compositional, and
functional context of a fragment within a genome has a strong influence on its
likelihood of correct classification, but in a way that depends on the taxonomic
and ecological similarity of the comparator genome.
Collapse
Affiliation(s)
- Scott C Perry
- Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
| | | |
Collapse
|
31
|
Thompson CC, Vicente ACP, Souza RC, Vasconcelos ATR, Vesth T, Alves N, Ussery DW, Iida T, Thompson FL. Genomic taxonomy of Vibrios. BMC Evol Biol 2009; 9:258. [PMID: 19860885 PMCID: PMC2777879 DOI: 10.1186/1471-2148-9-258] [Citation(s) in RCA: 132] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2009] [Accepted: 10/27/2009] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Vibrio taxonomy has been based on a polyphasic approach. In this study, we retrieve useful taxonomic information (i.e. data that can be used to distinguish different taxonomic levels, such as species and genera) from 32 genome sequences of different vibrio species. We use a variety of tools to explore the taxonomic relationship between the sequenced genomes, including Multilocus Sequence Analysis (MLSA), supertrees, Average Amino Acid Identity (AAI), genomic signatures, and Genome BLAST atlases. Our aim is to analyse the usefulness of these tools for species identification in vibrios. RESULTS We have generated four new genome sequences of three Vibrio species, i.e., V. alginolyticus 40B, V. harveyi-like 1DA3, and V. mimicus strains VM573 and VM603, and present a broad analyses of these genomes along with other sequenced Vibrio species. The genome atlas and pangenome plots provide a tantalizing image of the genomic differences that occur between closely related sister species, e.g. V. cholerae and V. mimicus. The vibrio pangenome contains around 26504 genes. The V. cholerae core genome and pangenome consist of 1520 and 6923 genes, respectively. Pangenomes might allow different strains of V. cholerae to occupy different niches. MLSA and supertree analyses resulted in a similar phylogenetic picture, with a clear distinction of four groups (Vibrio core group, V. cholerae-V. mimicus, Aliivibrio spp., and Photobacterium spp.). A Vibrio species is defined as a group of strains that share > 95% DNA identity in MLSA and supertree analysis, > 96% AAI, < or = 10 genome signature dissimilarity, and > 61% proteome identity. Strains of the same species and species of the same genus will form monophyletic groups on the basis of MLSA and supertree. CONCLUSION The combination of different analytical and bioinformatics tools will enable the most accurate species identification through genomic computational analysis. This endeavour will culminate in the birth of the online genomic taxonomy whereby researchers and end-users of taxonomy will be able to identify their isolates through a web-based server. This novel approach to microbial systematics will result in a tremendous advance concerning biodiversity discovery, description, and understanding.
Collapse
Affiliation(s)
- Cristiane C Thompson
- Laboratory of Molecular Genetics of Microrganims, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Brazil
| | - Ana Carolina P Vicente
- Laboratory of Molecular Genetics of Microrganims, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Brazil
| | - Rangel C Souza
- National Laboratory for Scientific Computing, Department of Applied and Computational Mathematics, Laboratory of Bioinformatics, Av. Getúlio Vargas 333, Quitandinha, 25651-070, Petropolis, RJ, Brazil
| | - Ana Tereza R Vasconcelos
- National Laboratory for Scientific Computing, Department of Applied and Computational Mathematics, Laboratory of Bioinformatics, Av. Getúlio Vargas 333, Quitandinha, 25651-070, Petropolis, RJ, Brazil
| | - Tammi Vesth
- Center for Biological Sequence Analysis, Department of Biotechnology, Building 208, The Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
| | - Nelson Alves
- Department of Genetics, Institute of Biology, Federal University of Rio de Janeiro, UFRJ, Brazil
| | - David W Ussery
- Center for Biological Sequence Analysis, Department of Biotechnology, Building 208, The Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
| | - Tetsuya Iida
- Laboratory of Genomic Research on Pathogenic Bacteria, International Research Center for Infectious Diseases, Research Institute for Microbial Diseases, Osaka University, Suita, Osaka 565-0871, Japan
| | - Fabiano L Thompson
- Department of Genetics, Institute of Biology, Federal University of Rio de Janeiro, UFRJ, Brazil
| |
Collapse
|