1
|
Sawada Y, Minei R, Tabata H, Ikemura T, Wada K, Wada Y, Nagata H, Iwasaki Y. Unsupervised AI reveals insect species-specific genome signatures. PeerJ 2024; 12:e17025. [PMID: 38464746 PMCID: PMC10924456 DOI: 10.7717/peerj.17025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2024] [Indexed: 03/12/2024] Open
Abstract
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the "model organism" for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.
Collapse
Affiliation(s)
- Yui Sawada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Ryuhei Minei
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Hiromasa Tabata
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Hiroshi Nagata
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| |
Collapse
|
2
|
Madival SD, Mishra DC, Sharma A, Kumar S, Maji AK, Budhlakoti N, Sinha D, Rai A. A Deep Clustering-based Novel Approach for Binning of Metagenomics Data. Curr Genomics 2022; 23:353-368. [PMID: 36778191 PMCID: PMC9878855 DOI: 10.2174/1389202923666220928150100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 08/30/2022] [Accepted: 09/02/2022] [Indexed: 11/22/2022] Open
Abstract
Background One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets. Objective To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures. Methods In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used. Results In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods. Conclusion An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.
Collapse
Affiliation(s)
| | | | - Anu Sharma
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Sanjeev Kumar
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Arpan Kumar Maji
- Division of Computer Applications, ICAR-IASRI, New Delhi- 110012, India
| | - Neeraj Budhlakoti
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Dipro Sinha
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Anil Rai
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| |
Collapse
|
3
|
AI-based search for convergently expanding, advantageous mutations in SARS-CoV-2 by focusing on oligonucleotide frequencies. PLoS One 2022; 17:e0273860. [PMID: 36044525 PMCID: PMC9432735 DOI: 10.1371/journal.pone.0273860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 08/16/2022] [Indexed: 11/19/2022] Open
Abstract
Among mutations that occur in SARS-CoV-2, efficient identification of mutations advantageous for viral replication and transmission is important to characterize and defeat this rampant virus. Mutations rapidly expanding frequency in a viral population are candidates for advantageous mutations, but neutral mutations hitchhiking with advantageous mutations are also likely to be included. To distinguish these, we focus on mutations that appear to occur independently in different lineages and expand in frequency in a convergent evolutionary manner. Batch-learning SOM (BLSOM) can separate SARS-CoV-2 genome sequences according by lineage from only providing the oligonucleotide composition. Focusing on remarkably expanding 20-mers, each of which is only represented by one copy in the viral genome, allows us to correlate the expanding 20-mers to mutations. Using visualization functions in BLSOM, we can efficiently identify mutations that have expanded remarkably both in the Omicron lineage, which is phylogenetically distinct from other lineages, and in other lineages. Most of these mutations involved changes in amino acids, but there were a few that did not, such as an intergenic mutation.
Collapse
|
4
|
Iwasaki Y, Ikemura T, Wada K, Wada Y, Abe T. Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands. BMC Genomics 2022; 23:497. [PMID: 35804296 PMCID: PMC9264310 DOI: 10.1186/s12864-022-08664-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/31/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data. RESULTS In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes. CONCLUSION Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Takashi Abe
- Smart Information Systems, Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan.
| |
Collapse
|
5
|
An integrated approach using growing self-organizing map-based genetic K-means clustering and tolerance rough set in occupational risk analysis. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06956-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
6
|
Iwasaki Y, Abe T, Wada K, Wada Y, Ikemura T. Unsupervised explainable AI for molecular evolutionary study of forty thousand SARS-CoV-2 genomes. BMC Microbiol 2022; 22:73. [PMID: 35272618 PMCID: PMC8907386 DOI: 10.1186/s12866-022-02484-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 02/28/2022] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Unsupervised AI (artificial intelligence) can obtain novel knowledge from big data without particular models or prior knowledge and is highly desirable for unveiling hidden features in big data. SARS-CoV-2 poses a serious threat to public health and one important issue in characterizing this fast-evolving virus is to elucidate various aspects of their genome sequence changes. We previously established unsupervised AI, a BLSOM (batch-learning SOM), which can analyze five million genomic sequences simultaneously. The present study applied the BLSOM to the oligonucleotide compositions of forty thousand SARS-CoV-2 genomes. RESULTS While only the oligonucleotide composition was given, the obtained clusters of genomes corresponded primarily to known main clades and internal divisions in the main clades. Since the BLSOM is explainable AI, it reveals which features of the oligonucleotide composition are responsible for clade clustering. Additionally, BLSOM also provided information concerning the special genomic region possibly undergoing RNA modifications. CONCLUSIONS The BLSOM has powerful image display capabilities and enables efficient knowledge discovery about viral evolutionary processes, and it can complement phylogenetic methods based on sequence alignment.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Takashi Abe
- Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan
| | - Kennosuke Wada
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Yoshiko Wada
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Toshimichi Ikemura
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan. .,National Institute of Genetics, Mishima, Shizuoka-ken, 411-8540, Japan.
| |
Collapse
|
7
|
Ikemura T, Iwasaki Y, Wada K, Wada Y, Abe T. AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome. Genes Genet Syst 2021; 96:165-176. [PMID: 34565757 DOI: 10.1266/ggs.21-00025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.
Collapse
Affiliation(s)
| | - Yuki Iwasaki
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Kennosuke Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Yoshiko Wada
- Faculty of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Niigata University
| |
Collapse
|
8
|
Katsura Y, Ikemura T, Kajitani R, Toyoda A, Itoh T, Ogata M, Miura I, Wada K, Wada Y, Satta Y. Comparative genomics of Glandirana rugosa using unsupervised AI reveals a high CG frequency. Life Sci Alliance 2021; 4:4/5/e202000905. [PMID: 33712508 PMCID: PMC7994367 DOI: 10.26508/lsa.202000905] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 02/16/2021] [Accepted: 02/17/2021] [Indexed: 11/30/2022] Open
Abstract
Genome sequencing of a unique frog (Glandirana rugosa) having XY/ZW systems within the species and comparative genomics with other six frogs were performed using a batch-learning self-organizing map, which is unsupervised AI for oligonucleotide compositions, to clarify its genome characteristics. The Japanese wrinkled frog (Glandirana rugosa) is unique in having both XX-XY and ZZ-ZW types of sex chromosomes within the species. The genome sequencing and comparative genomics with other frogs should be important to understand mechanisms of turnover of sex chromosomes within one species or during a short period. In this study, we analyzed the newly sequenced genome of G. rugosa using a batch-learning self-organizing map which is unsupervised artificial intelligence for oligonucleotide compositions. To clarify genome characteristics of G. rugosa, we compared its short oligonucleotide compositions in all 1-Mb genomic fragments with those of other six frog species (Pyxicephalus adspersus, Rhinella marina, Spea multiplicata, Leptobrachium leishanense, Xenopus laevis, and Xenopus tropicalis). In G. rugosa, we found an Mb-level large size of repeat sequences having a high identity with the W chromosome of the African bullfrog (P. adspersus). Our study concluded that G. rugosa has unique genome characteristics with a high CG frequency, and its genome is assumed to heterochromatinize a large size of genome via methylataion of CG.
Collapse
Affiliation(s)
- Yukako Katsura
- Primate Research Institute, Kyoto University, Inuyama-shi, Japan .,Amphibian Research Center, Hiroshima University, Hiroshima-shi, Japan.,Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, The Graduate University For Advanced Studies (SOKENDAI), Shonankokuraimura, Hayama-machi, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Japan
| | - Rei Kajitani
- Department of Life Science and Technology, School of Life Science and Technology, Tokyo Institute of Technology, Tokyo-to, Japan
| | - Atsushi Toyoda
- Department of Genomics and Evolutionary Biology, National Institute of Genetics, Mishima-shi, Japan
| | - Takehiko Itoh
- Department of Life Science and Technology, School of Life Science and Technology, Tokyo Institute of Technology, Tokyo-to, Japan
| | | | - Ikuo Miura
- Amphibian Research Center, Hiroshima University, Hiroshima-shi, Japan
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Japan
| | - Yoko Satta
- Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, The Graduate University For Advanced Studies (SOKENDAI), Shonankokuraimura, Hayama-machi, Japan
| |
Collapse
|
9
|
Wada K, Wada Y, Ikemura T. Mb-level CpG and TFBS islands visualized by AI and their roles in the nuclear organization of the human genome. Genes Genet Syst 2020; 95:29-41. [PMID: 32161227 DOI: 10.1266/ggs.19-00027] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Unsupervised machine learning that can discover novel knowledge from big sequence data without prior knowledge or particular models is highly desirable for current genome study. We previously established a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions, which can reveal various novel genome characteristics from big sequence data, and found that transcription factor binding sequences (TFBSs) and CpG-containing oligonucleotides are enriched in human centromeric and pericentromeric regions, which support centromere clustering and form the condensed heterochromatin "chromocenter" in interphase nuclei. The number and size of chromocenters, as well as the type of centromeres gathered in individual chromocenters, vary depending on cell type. To study molecular mechanisms of cell type-dependent chromocenter formation, we analyzed distribution patterns of occurrence per Mb of hexa- and heptanucleotide TFBSs, which have been compiled by the SwissRegulon Portal, and of CpG-containing oligonucleotides. We found Mb-level islands enriched for TFBSs and CpG-containing oligonucleotides in centromeric and pericentromeric regions on all human chromosomes except chrY. Considering molecular mechanisms for cell type-dependent centromere clustering, the chromosome-dependent enrichment of a set of TFBSs and CpG-containing oligonucleotides is of particular interest, since the cellular content of TFs and methyl-CpG-binding proteins exhibits cell type-dependent regulation. A newly introduced BLSOM, which analyzed occurrences of a total of 3,946 octanucleotide TFBSs compiled by the SwissRegulon Portal, has self-organized (separated) the sequences that are characteristically enriched in TFBSs and shown that these sequences are derived primarily from centromeric and pericentromeric constitutive heterochromatin regions. Furthermore, the BLSOM identified and visualized characteristic TFBSs that are enriched in these regions. By analyzing Hi-C data for interchromosomal interactions, the present study showed that the chromatin segments supporting the interchromosomal interactions locate primarily in Mb-level TFBS and CpG islands and are thus enriched for a wide variety of TFBSs and CG-containing oligonucleotides.
Collapse
Affiliation(s)
- Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| |
Collapse
|
10
|
Wada Y, Iwasaki Y, Abe T, Wada K, Tooyama I, Ikemura T. CG-containing oligonucleotides and transcription factor-binding motifs are enriched in human pericentric regions. Genes Genet Syst 2016; 90:43-53. [PMID: 26119665 DOI: 10.1266/ggs.90.43] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Unsupervised data mining capable of extracting a wide range of information from big sequence data without prior knowledge or particular models is highly desirable in an era of big data accumulation for research on genes, genomes and genetic systems. By handling oligonucleotide compositions in genomic sequences as high-dimensional data, we have previously modified the conventional SOM (self-organizing map) for genome informatics and established BLSOM for oligonucleotide composition, which can analyze more than ten million sequences simultaneously and is thus suitable for big data analyses. Oligonucleotides often represent motif sequences responsible for sequence-specific binding of proteins such as transcription factors. The distribution of such functionally important oligonucleotides is probably biased in genomic sequences, and may differ among genomic regions. When constructing BLSOMs to analyze pentanucleotide composition in 50-kb sequences derived from the human genome in this study, we found that BLSOMs did not classify human sequences according to chromosome but revealed several specific zones, which are enriched for a class of CG-containing pentanucleotides; these zones are composed primarily of sequences derived from pericentric regions. The biological significance of enrichment of these pentanucleotides in pericentric regions is discussed in connection with cell type- and stage-dependent formation of the condensed heterochromatin in the chromocenter, which is formed through association of pericentric regions of multiple chromosomes.
Collapse
Affiliation(s)
- Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| | | | | | | | | | | |
Collapse
|
11
|
Iwasaki Y, Abe T, Okada N, Wada K, Wada Y, Ikemura T. Evolutionary changes in vertebrate genome signatures with special focus on coelacanth. DNA Res 2014; 21:459-67. [PMID: 24800745 PMCID: PMC4195492 DOI: 10.1093/dnares/dsu012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Institute of Science and Technology, Niigata University, Niigata-ken 950-2181, Japan
| | - Norihiro Okada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan Faculty of Bioscience and Biotechnology, Tokyo Institute of Technology, Yokohama, Kanagawa 226, Japan Department of Life Sciences, National Cheng Kung University, Tainan 701, Taiwan
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama, Shiga 526-0829, Japan
| |
Collapse
|
12
|
Laczny CC, Pinel N, Vlassis N, Wilmes P. Alignment-free visualization of metagenomic data by nonlinear dimension reduction. Sci Rep 2014; 4:4516. [PMID: 24682077 PMCID: PMC3970189 DOI: 10.1038/srep04516] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2013] [Accepted: 03/13/2014] [Indexed: 11/10/2022] Open
Abstract
The visualization of metagenomic data, especially without prior taxonomic identification of reconstructed genomic fragments, is a challenging problem in computational biology. An ideal visualization method should, among others, enable clear distinction of congruent groups of sequences of closely related taxa, be applicable to fragments of lengths typically achievable following assembly, and allow the efficient analysis of the growing amounts of community genomic sequence data. Here, we report a scalable approach for the visualization of metagenomic data that is based on nonlinear dimension reduction via Barnes-Hut Stochastic Neighbor Embedding of centered log-ratio transformed oligonucleotide signatures extracted from assembled genomic sequence fragments. The approach allows for alignment-free assessment of the data-inherent taxonomic structure, and it can potentially facilitate the downstream binning of genomic fragments into uniform clusters reflecting organismal origin. We demonstrate the performance of our approach by visualizing community genomic sequence data from simulated as well as groundwater, human-derived and marine microbial communities.
Collapse
Affiliation(s)
- Cedric C Laczny
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Nicolás Pinel
- 1] Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg [2] Institute for Systems Biology, Seattle, Washington, USA
| | - Nikos Vlassis
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Paul Wilmes
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
13
|
Visualization of genome signatures of eukaryote genomes by batch-learning self-organizing map with a special emphasis on Drosophila genomes. BIOMED RESEARCH INTERNATIONAL 2014; 2014:985706. [PMID: 24741568 PMCID: PMC3967822 DOI: 10.1155/2014/985706] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2013] [Accepted: 02/04/2014] [Indexed: 11/24/2022]
Abstract
A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method “BLSOM” for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.
Collapse
|
14
|
Iwasaki Y, Abe T, Wada K, Wada Y, Ikemura T. A Novel Bioinformatics Strategy to Analyze Microbial Big Sequence Data for Efficient Knowledge Discovery: Batch-Learning Self-Organizing Map (BLSOM). Microorganisms 2013; 1:137-157. [PMID: 27694768 PMCID: PMC5029494 DOI: 10.3390/microorganisms1010137] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Revised: 11/05/2013] [Accepted: 11/08/2013] [Indexed: 11/24/2022] Open
Abstract
With the remarkable increase of genomic sequence data of microorganisms, novel tools are needed for comprehensive analyses of the big sequence data available. The self-organizing map (SOM) is an effective tool for clustering and visualizing high-dimensional data, such as oligonucleotide composition on one map. By modifying the conventional SOM, we developed batch-learning SOM (BLSOM), which allowed classification of sequence fragments (e.g., 1 kb) according to phylotypes, solely depending on oligonucleotide composition. Metagenomics studies of uncultivable microorganisms in clinical and environmental samples should allow extensive surveys of genes important in life sciences. BLSOM is most suitable for phylogenetic assignment of metagenomic sequences, because fragmental sequences can be clustered according to phylotypes, solely depending on oligonucleotide composition. We first constructed oligonucleotide BLSOMs for all available sequences from genomes of known species, and by mapping metagenomic sequences on these large-scale BLSOMs, we can predict phylotypes of individual metagenomic sequences, revealing a microbial community structure of uncultured microorganisms, including viruses. BLSOM has shown that influenza viruses isolated from humans and birds clearly differ in oligonucleotide composition. Based on this host-dependent oligonucleotide composition, we have proposed strategies for predicting directional changes of virus sequences and for surveilling potentially hazardous strains when introduced into humans from non-human sources.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Japan Society for the Promotion of Science, Chiyoda-ku, Tokyo 102-0083, Japan.
| | - Takashi Abe
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata-ken 950-2181, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
- Faculty of Medicine, Shiga University of Medical Science, Shiga-ken 520-2121, Japan.
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken 526-0829, Japan.
| |
Collapse
|
15
|
Iwasaki Y, Wada K, Wada Y, Abe T, Ikemura T. Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance. Chromosome Res 2013; 21:461-74. [PMID: 23896648 PMCID: PMC3761090 DOI: 10.1007/s10577-013-9371-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Revised: 06/14/2013] [Accepted: 06/14/2013] [Indexed: 11/29/2022]
Abstract
Since oligonucleotide composition in the genome sequence varies significantly among species even among those possessing the same genome G + C%, the composition has been used to distinguish a wide range of genomes and called as “genome signature”. Oligonucleotides often represent motif sequences responsible for sequence-specific protein binding (e.g., transcription-factor binding). Occurrences of such motif oligonucleotides in the genome should be biased compared to those observed in random sequences and may differ among genomes and genomic portions. Self-Organizing Map (SOM) is a powerful tool for clustering high-dimensional data such as oligonucleotide composition on one plane. We previously modified the conventional SOM for genome informatics to batch learning SOM or “BLSOM”. When we constructed BLSOMs to analyze pentanucleotide composition in 20-, 50-, and 100-kb sequences derived from the human genome, BLSOMs did not classify human sequences according to chromosome but revealed several specific zones composed primarily of sequences derived from pericentric regions. Interestingly, various transcription-factor-binding motifs were characteristically overrepresented in pericentric regions but underrepresented in most genomic sequences. When we focused on much shorter sequences (e.g., 1 kb), the clustering of transcription-factor-binding motifs was evident in pericentric, subtelomeric and sex chromosome pseudoautosomal regions. The biological significance of the clustering in these regions was discussed in connection with cell-type and -stage-dependent chromocenter formation and nuclear organization.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | | | | | | | | |
Collapse
|
16
|
Alsop EB, Raymond J. Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification. PLoS One 2013; 8:e67337. [PMID: 23840870 PMCID: PMC3698125 DOI: 10.1371/journal.pone.0067337] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 05/16/2013] [Indexed: 11/19/2022] Open
Abstract
Oligonucleotide signatures, especially tetranucleotide signatures, have been used as method for homology binning by exploiting an organism’s inherent biases towards the use of specific oligonucleotide words. Tetranucleotide signatures have been especially useful in environmental metagenomics samples as many of these samples contain organisms from poorly classified phyla which cannot be easily identified using traditional homology methods, including NCBI BLAST. This study examines oligonucleotide signatures across 1,424 completed genomes from across the tree of life, substantially expanding upon previous work. A comprehensive analysis of mononucleotide through nonanucleotide word lengths suggests that longer word lengths substantially improve the classification of DNA fragments across a range of sizes of relevance to high throughput sequencing. We find that, at present, heptanucleotide signatures represent an optimal balance between prediction accuracy and computational time for resolving taxonomy using both genomic and metagenomic fragments. We directly compare the ability of tetranucleotide and heptanucleotide world lengths (tetranucleotide signatures are the current standard for oligonucleotide word usage analyses) for taxonomic binning of metagenome reads. We present evidence that heptanucleotide word lengths consistently provide more taxonomic resolving power, particularly in distinguishing between closely related organisms that are often present in metagenomic samples. This implies that longer oligonucleotide word lengths should replace tetranucleotide signatures for most analyses. Finally, we show that the application of longer word lengths to metagenomic datasets leads to more accurate taxonomic binning of DNA scaffolds and have the potential to substantially improve taxonomic assignment and assembly of metagenomic data.
Collapse
Affiliation(s)
- Eric B Alsop
- School of Earth and Space Exploration, Arizona State University, Tempe, Arizona, United States of America.
| | | |
Collapse
|
17
|
Mitsumori M, Nakagawa S, Matsui H, Shinkai T, Takenaka A. Phylogenetic diversity of gene sequences isolated from the rumen as analysed using a self-organizing map (SOM). J Appl Microbiol 2010; 109:763-70. [PMID: 20233261 DOI: 10.1111/j.1365-2672.2010.04703.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
AIMS To determine the origins of DNA sequences isolated from the rumen microbial ecosystem using a self-organizing map (SOM). METHODS AND RESULTS DNA sequences other than 16S small subunit ribosomal RNA (SSU rRNA) gene sequences that were detected from the rumen were analysed by the SOM method reported by Abe et al. [2000, Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes. Gene 365, 27-34]. Because query sequences positioned by SOM were scattered on the master drawing of SOM, it was suggested that many DNA sequences isolated from the rumen were collected from a broad range of micro-organisms. Although the results obtained by SOM were similar to those obtained by the neighbour-joining (NJ) method, SOM was able to presume the phylotypes of the query sequences without information about the 16S SSU rRNA gene sequences and homology searches, and to reveal existence of novel micro-organisms deduced to be cellulolytic bacteria, archaea and methanotrophic bacterium. CONCLUSIONS As the SOM method defined phylotypes of unreported rumen micro-organisms, it is presumed that these phylotypes would be involved in rumen fermentation in cooperation with known rumen micro-organisms. Moreover, it is demonstrated that SOM is a useful tool for affiliating DNA sequences, which have no matches in databases. SIGNIFICANCE AND IMPACT OF STUDY Through SOM analysis, a better means of identifying rumen micro-organisms and estimating their roles in rumen function was provided.
Collapse
Affiliation(s)
- M Mitsumori
- National Institute of Livestock and Grassland Science, Tsukuba, Ibaraki, Japan.
| | | | | | | | | |
Collapse
|
18
|
Kato S, Watanabe K. Ecological and Evolutionary Interactions in Syntrophic Methanogenic Consortia. Microbes Environ 2010; 25:145-51. [DOI: 10.1264/jsme2.me10122] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
| | - Kazuya Watanabe
- Hashimoto Light Energy Conversion Project, ERATO, JST
- Research Center for Advanced Science and Technology, University of Tokyo
| |
Collapse
|
19
|
Abe T, Kanaya S, Uehara H, Ikemura T. A novel bioinformatics strategy for function prediction of poorly-characterized protein genes obtained from metagenome analyses. DNA Res 2009; 16:287-97. [PMID: 19801558 PMCID: PMC2762413 DOI: 10.1093/dnares/dsp018] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
As a result of remarkable progresses of DNA sequencing technology, vast quantities of genomic sequences have been decoded. Homology search for amino acid sequences, such as BLAST, has become a basic tool for assigning functions of genes/proteins when genomic sequences are decoded. Although the homology search has clearly been a powerful and irreplaceable method, the functions of only 50% or fewer of genes can be predicted when a novel genome is decoded. A prediction method independent of the homology search is urgently needed. By analyzing oligonucleotide compositions in genomic sequences, we previously developed a modified Self-Organizing Map ‘BLSOM’ that clustered genomic fragments according to phylotype with no advance knowledge of phylotype. Using BLSOM for di-, tri- and tetrapeptide compositions, we developed a system to enable separation (self-organization) of proteins by function. Analyzing oligopeptide frequencies in proteins previously classified into COGs (clusters of orthologous groups of proteins), BLSOMs could faithfully reproduce the COG classifications. This indicated that proteins, whose functions are unknown because of lack of significant sequence similarity with function-known proteins, can be related to function-known proteins based on similarity in oligopeptide composition. BLSOM was applied to predict functions of vast quantities of proteins derived from mixed genomes in environmental samples.
Collapse
Affiliation(s)
- Takashi Abe
- Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Shiga-ken, Japan.
| | | | | | | |
Collapse
|
20
|
Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y. A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment. BMC Genomics 2009; 10:229. [PMID: 19445709 PMCID: PMC2696472 DOI: 10.1186/1471-2164-10-229] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2008] [Accepted: 05/16/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics. RESULTS To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment. CONCLUSION The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.
Collapse
Affiliation(s)
- Shani Tzahor
- Faculty of Biology, Technion – Israel Institute of Technology, Haifa 32000, Israel
- Inter-Departmental Program for Biotechnology, Technion – Israel Institute of Technology, Haifa 32000, Israel
| | | | - Benjamin C Kirkup
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Tali Yogev
- Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 52900, Israel
| | | | - Martin F Polz
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Oded Béjà
- Faculty of Biology, Technion – Israel Institute of Technology, Haifa 32000, Israel
| | | |
Collapse
|
21
|
Tzahor S, Man-Aharonovich D, Kirkup BC, Yogev T, Berman-Frank I, Polz MF, Béjà O, Mandel-Gutfreund Y. A supervised learning approach for taxonomic classification of core-photosystem-II genes and transcripts in the marine environment. BMC Genomics 2009. [PMID: 19445709 DOI: 10.1186/1471-2164-10-229.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes encoding the photosystem II reaction center (psbA and psbD) were found in cyanophage genomes. This phenomenon suggested that the horizontal transfer of these genes may be involved in increasing phage fitness. To date, a very small percentage of marine bacteria and phages has been cultured. Thus, mapping genomic data extracted directly from the environment to its taxonomic origin is necessary for a better understanding of phage-host relationships and dynamics. RESULTS To achieve an accurate and rapid taxonomic classification, we employed a computational approach combining a multi-class Support Vector Machine (SVM) with a codon usage position specific scoring matrix (cuPSSM). Our method has been applied successfully to classify core-photosystem-II gene fragments, including partial sequences coming directly from the ocean, to seven different taxonomic classes. Applying the method on a large set of DNA and RNA psbA clones from the Mediterranean Sea, we studied the distribution of cyanobacterial psbA genes and transcripts in their natural environment. Using our approach, we were able to simultaneously examine taxonomic and ecological distributions in the marine environment. CONCLUSION The ability to accurately classify the origin of individual genes and transcripts coming directly from the environment is of great importance in studying marine ecology. The classification method presented in this paper could be applied further to classify other genes amplified from the environment, for which training data is available.
Collapse
Affiliation(s)
- Shani Tzahor
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria. ARCHAEA-AN INTERNATIONAL MICROBIOLOGICAL JOURNAL 2009; 2:159-67. [PMID: 19054742 DOI: 10.1155/2008/829730] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Classification and regression tree (CART) analysis was applied to genome-wide tetranucleotide frequencies (genomic signatures) of 195 archaea and bacteria. Although genomic signatures have typically been used to classify evolutionary divergence, in this study, convergent evolution was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (nonlinear) qualities of genomes may reflect certain environmental conditions (such as temperature) in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results.
Collapse
|
23
|
Batch-Learning Self-Organizing Map for Predicting Functions of Poorly-Characterized Proteins Massively Accumulated. ACTA ACUST UNITED AC 2009. [DOI: 10.1007/978-3-642-02397-2_1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
24
|
Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. J Biomed Biotechnol 2008; 2008:513701. [PMID: 18288261 PMCID: PMC2235928 DOI: 10.1155/2008/513701] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2007] [Accepted: 11/18/2007] [Indexed: 11/17/2022] Open
Abstract
Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%-15% speed improvement.
Collapse
|
25
|
Kosaka T, Kato S, Shimoyama T, Ishii S, Abe T, Watanabe K. The genome of Pelotomaculum thermopropionicum reveals niche-associated evolution in anaerobic microbiota. Genome Res 2008; 18:442-8. [PMID: 18218977 DOI: 10.1101/gr.7136508] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The anaerobic biodegradation of organic matter is accomplished by sequential syntrophic catabolism by microbes in different niches. Pelotomaculum thermopropionicum is a representative syntrophic bacterium that catalyzes the intermediate bottleneck step in the anaerobic-biodegradation process, whereby volatile fatty acids (VFAs) and alcohols produced by upstream fermenting bacteria are converted to acetate, hydrogen, and carbon dioxide (substrates for downstream methanogenic archaea). To reveal genomic features that contribute to our understanding of the ecological niche and evolution of P. thermopropionicum, we sequenced its 3,025,375-bp genome and performed comparative analyses with genomes of other community members available in the databases. In the genome, 2920 coding sequences (CDSs) were identified. These CDSs showed a distinct distribution pattern in the functional categories of the Clusters of Orthologous Groups database, which is considered to reflect the niche of this organism. P. thermopropionicum has simple catabolic pathways, in which the propionate-oxidizing methylmalonyl-CoA pathway constitutes the backbone and is linked to several peripheral pathways. Genes for most of the important catabolic enzymes are physically linked to those for PAS-domain-containing regulators, suggesting that the catabolic pathways are regulated in response to environmental conditions and/or global cellular situations rather than specific substrates. Comparative analyses of codon usages revealed close evolutionary relationships between P. thermopropionicum and other niche members, while it was distant from phylogenetically related sugar-fermenting bacteria. These analyses suggest that P. thermopropionicum has evolved as a syntrophy specialist by interacting with niche-associated microbes.
Collapse
Affiliation(s)
- Tomoyuki Kosaka
- Laboratory of Applied Microbiology, Marine Biotechnology Institute, Kamaishi, Iwate 026-0001, Japan
| | | | | | | | | | | |
Collapse
|