1
|
Azziz A, Edely M, Liu Q, Majdinasab M, Arib C, Xiang Y, Fu W, de la Chapelle ML. Study of the DNA structure and orientation using SERS: Influence of the hybridisation and mismatches. Int J Biol Macromol 2025; 307:141859. [PMID: 40058423 DOI: 10.1016/j.ijbiomac.2025.141859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Revised: 02/25/2025] [Accepted: 03/06/2025] [Indexed: 05/07/2025]
Abstract
In this article we study the structure and the orientation of DNA strands by Surface Enhanced Raman Scattering (SERS). We study the influence of two parameters on the structure of strands containing 20 adenines: the hybridization with the complementary strand and the presence of mismatch within the sequence. By varying the concentration of complementary strands, we show that hybridisation induces a change in strand orientation and loss of flexibility, indicating that the formation of the double helix freezes the conformation of DNA strand. The introduction of a mismatch has the same effects on strand orientation and flexibility but also induces hybridisation defects in the formation of the double helix. We therefore highlight the presence of non-hybridized adenine bases, this effect being all the more visible when the mismatch is close to the centre of the strand. We also highlight spectral markers of these structural changes and of the evolution of hybridisation. For example, we observe the main band shift of the adenine from 734 to 747 cm-1 indicating a reorientation of the base during hybridisation from a perpendicular configuration to a configuration parallel to the surface.
Collapse
Affiliation(s)
- Aicha Azziz
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France
| | - Mathieu Edely
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France
| | - Qiqian Liu
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France
| | - Marjan Majdinasab
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France; Department of Food Science & Technology, School of Agriculture, Shiraz University, Shiraz, Iran
| | - Celia Arib
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France
| | - Yang Xiang
- Department of Laboratory Medicine, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing 400038, China
| | - Weiling Fu
- Department of Laboratory Medicine, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing 400038, China
| | - Marc Lamy de la Chapelle
- Institut des Molécules et Matériaux du Mans (IMMM UMR 6283 CNRS), Le Mans Université, Avenue Olivier Messiaen, CEDEX 9, 72085 Le Mans, France; Department of Laboratory Medicine, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing 400038, China; Nanobiophotonics and Laser Microspectroscopy Center, Interdisciplinary Research Institute in Bio-Nano-Sciences, Babes-Bolyai University, Cluj-Napoca, Romania.
| |
Collapse
|
2
|
Yu T, Cheng L, Khalitov R, Olsson EB, Yang Z. Self-distillation improves self-supervised learning for DNA sequence inference. Neural Netw 2025; 183:106978. [PMID: 39667220 DOI: 10.1016/j.neunet.2024.106978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 10/28/2024] [Accepted: 11/26/2024] [Indexed: 12/14/2024]
Abstract
Self-supervised Learning (SSL) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSL approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a 'student' and a 'teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
Collapse
Affiliation(s)
- Tong Yu
- Norwegian University of Science and Technology, Trondheim, Norway.
| | - Lei Cheng
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Ruslan Khalitov
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Erland B Olsson
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Zhirong Yang
- Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
3
|
Arora P, Kumar S, Mukhopadhyay CS, Kaur S. Codon usage analysis in selected virulence genes of Staphylococcal species. Curr Genet 2025; 71:5. [PMID: 39853506 DOI: 10.1007/s00294-025-01308-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Revised: 12/25/2024] [Accepted: 01/03/2025] [Indexed: 01/26/2025]
Abstract
The Staphylococcus genus, composed of Gram-positive bacteria, includes several pathogenic species such as Staphylococcus aureus, S. epidermidis, S. haemolyticus, and S. saprophyticus, each implicated in a range of infections. This study investigates the codon usage patterns in key virulence genes, including Autolysin (alt), Elastin Binding protein (EbpS), Lipase, Thermonuclease, Intercellular Adhesion Protein (IcaR), and V8 Protease, across four Staphylococcus species. Using metrics such as the Effective Number of Codons (ENc), Relative Synonymous Codon Usage (RSCU), Codon Adaptation Index (CAI), alongside neutrality and parity plots, we explored the codon preferences and nucleotide composition biases. Our findings revealed a pronounced AT-rich codon preference, with AT-rich genomes likely aiding in energy-efficient translation and bacterial survival in host environments. These insights provide a deeper understanding of the evolutionary adaptations and translational efficiency mechanisms that contribute to the pathogenicity of Staphylococcus species. This knowledge could pave the way for novel therapeutic interventions targeting codon usage to disrupt virulence gene expression.
Collapse
Affiliation(s)
- Pinky Arora
- School of Bioengineering and Biosciences, Lovely Professional University, Jalandhar-Delhi G.T. Road, Phagwara, Punjab, 144411, India
| | - Shubham Kumar
- School of Pharmaceutical Sciences, Lovely Professional, University, Jalandhar- G.T. Road, Phagwara, Punjab, 144411, India
| | - Chandra Shekhar Mukhopadhyay
- Department of Bioinformatics, College of Animal Biotechnology, Guru Angad Dev Veterinary and Animal Sciences University, Ferozepur G.T. Road, Ludhiana, Punjab, 141004, India
| | - Sandeep Kaur
- Department of Medical Laboratory Sciences, Lovely Professional University, Phagwara, 144411, Punjab, India.
| |
Collapse
|
4
|
Zhang Z. Laws of Genome Nucleotide Composition. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae061. [PMID: 39213341 PMCID: PMC11514846 DOI: 10.1093/gpbjnl/qzae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 07/12/2024] [Accepted: 08/22/2024] [Indexed: 09/04/2024]
Affiliation(s)
- Zhang Zhang
- National Genomics Data Center, China National Center for Bioinformation, Beijing 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
5
|
Yang H, Liu S, Chen S, Lu P, Huang J, Sun L, Liu H. Novel 4-chlorophenoxyacetate dioxygenase-mediated phenoxyalkanoic acid herbicides initial catabolism in Cupriavidus sp. DL-D2. JOURNAL OF HAZARDOUS MATERIALS 2024; 478:135427. [PMID: 39116741 DOI: 10.1016/j.jhazmat.2024.135427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 08/01/2024] [Accepted: 08/02/2024] [Indexed: 08/10/2024]
Abstract
Microbial metabolism is an important driving force for the elimination of 4-chlorophenoxyacetic acid residues in the environment. The α-Ketoglutarate-dependent dioxygenase (TfdA) or 2,4-D oxygenase (CadAB) catalyzes the cleavage of the aryl ether bond of 4-chlorophenoxyacetic acid to 4-chlorophenol, which is one of the important pathways for the initial metabolism of 4-chlorophenoxyacetic acid by microorganisms. However, strain Cupriavidus sp. DL-D2 could utilize 4-chlorophenoxyacetic acid but not 4-chlorophenol for growth. This scarcely studied degradation pathway may involve novel enzymes that has not yet been characterized. Here, a gene cluster (designated cpd) responsible for the catabolism of 4-chlorophenoxyacetic acid in strain DL-D2 was cloned and identified, and the dioxygenase CpdA/CpdB responsible for the initial degradation of 4-chlorophenoxyacetic acid was successfully expressed, which could catalyze the conversion of 4-chlorphenoxyacetic acid to 4-chlorocatechol. Then, an aromatic cleavage enzyme CpdC further converts 4-chlorocatechol into 3-chloromuconate. The results of substrate degradation experiments showed that CpdA/CpdB could also degrade 3-chlorophenoxyacetic acid and phenoxyacetic acid, and homologous cpd gene clusters were widely discovered in microbial genomes. Our findings revealed a novel degradation mechanism of 4-chlorophenoxyacetic acid at the molecular level.
Collapse
Affiliation(s)
- Hao Yang
- The Anhui Provincial Key Laboratory of Biodiversity Conservation and Ecological Security in the Yangtze River Basin, Anhui Normal University, Wuhu 241000, Anhui, PR China; Anhui Provincial Key Laboratory of Molecular Enzymology and Mechanism of Major Metabolic Diseases, College of Life Sciences, Anhui Normal University, Wuhu 241000, Anhui, PR China
| | - Shiyan Liu
- The Anhui Provincial Key Laboratory of Biodiversity Conservation and Ecological Security in the Yangtze River Basin, Anhui Normal University, Wuhu 241000, Anhui, PR China; Anhui Provincial Key Laboratory of Molecular Enzymology and Mechanism of Major Metabolic Diseases, College of Life Sciences, Anhui Normal University, Wuhu 241000, Anhui, PR China
| | - Sitong Chen
- The Anhui Provincial Key Laboratory of Biodiversity Conservation and Ecological Security in the Yangtze River Basin, Anhui Normal University, Wuhu 241000, Anhui, PR China; Anhui Provincial Key Laboratory of Molecular Enzymology and Mechanism of Major Metabolic Diseases, College of Life Sciences, Anhui Normal University, Wuhu 241000, Anhui, PR China
| | - Peng Lu
- The Anhui Provincial Key Laboratory of Biodiversity Conservation and Ecological Security in the Yangtze River Basin, Anhui Normal University, Wuhu 241000, Anhui, PR China; Anhui Provincial Key Laboratory of Molecular Enzymology and Mechanism of Major Metabolic Diseases, College of Life Sciences, Anhui Normal University, Wuhu 241000, Anhui, PR China
| | - Junwei Huang
- College of Resources and Environment, Anhui Agricultural University, Anhui Provincial Key Laboratory of Hazardous Factors and Risk Control of Agri-food Quality Safety, Hefei 230036, PR China
| | - Lina Sun
- Eco-Environmental Protection Research Institute, Shanghai Academy of Agricultural Sciences, Shanghai 201403, PR China.
| | - Hongming Liu
- The Anhui Provincial Key Laboratory of Biodiversity Conservation and Ecological Security in the Yangtze River Basin, Anhui Normal University, Wuhu 241000, Anhui, PR China; Anhui Provincial Key Laboratory of Molecular Enzymology and Mechanism of Major Metabolic Diseases, College of Life Sciences, Anhui Normal University, Wuhu 241000, Anhui, PR China.
| |
Collapse
|
6
|
Chantzi N, Mareboina M, Konnaris MA, Montgomery A, Patsakis M, Mouratidis I, Georgakopoulos-Soares I. The determinants of the rarity of nucleic and peptide short sequences in nature. NAR Genom Bioinform 2024; 6:lqae029. [PMID: 38584871 PMCID: PMC10993293 DOI: 10.1093/nargab/lqae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 02/21/2024] [Accepted: 03/18/2024] [Indexed: 04/09/2024] Open
Abstract
The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.
Collapse
Affiliation(s)
- Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Maxwell A Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Department of Statistics, Penn State University, University Park, PA, 16802, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
- Huck Institutes of the Life Sciences, Penn State University, University Park, PA, 16802, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, 17033, USA
| |
Collapse
|
7
|
Elsherbini AMA, Elkholy AH, Fadel YM, Goussarov G, Elshal AM, El-Hadidi M, Mysara M. Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques. BMC Bioinformatics 2024; 25:131. [PMID: 38539073 PMCID: PMC10967124 DOI: 10.1186/s12859-024-05648-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 01/10/2024] [Indexed: 11/11/2024] Open
Abstract
The global spread of the SARS-CoV-2 pandemic, originating in Wuhan, China, has had profound consequences on both health and the economy. Traditional alignment-based phylogenetic tree methods for tracking epidemic dynamics demand substantial computational power due to the growing number of sequenced strains. Consequently, there is a pressing need for an alignment-free approach to characterize these strains and monitor the dynamics of various variants. In this work, we introduce a swift and straightforward tool named GenoSig, implemented in C++. The tool exploits the Di and Tri nucleotide frequency signatures to delineate the taxonomic lineages of SARS-CoV-2 by employing diverse machine learning (ML) and deep learning (DL) models. Our approach achieved a tenfold cross-validation accuracy of 87.88% (± 0.013) for DL and 86.37% (± 0.0009) for Random Forest (RF) model, surpassing the performance of other ML models. Validation using an additional unexposed dataset yielded comparable results. Despite variations in architectures between DL and RF, it was observed that later clades, specifically GRA, GRY, and GK, exhibited superior performance compared to earlier clades G and GH. As for the continental origin of the virus, both DL and RF models exhibited lower performance than in predicting clades. However, both models demonstrated relatively higher accuracy for Europe, North America, and South America compared to other continents, with DL outperforming RF. Both models consistently demonstrated a preference for cytosine and guanine over adenine and thymine in both clade and continental analyses, in both Di and Tri nucleotide frequencies signatures. Our findings suggest that GenoSig provides a straightforward approach to address taxonomic, epidemiological, and biological inquiries, utilizing a reductive method applicable not only to SARS-CoV-2 but also to similar research questions in an alignment-free context.
Collapse
Affiliation(s)
- Ahmed M A Elsherbini
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt
| | - Amr Hassan Elkholy
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt
| | - Youssef M Fadel
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt
| | - Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Ahmed Mohamed Elshal
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt
| | - Mohamed El-Hadidi
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt
| | - Mohamed Mysara
- Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt.
| |
Collapse
|
8
|
Sawada Y, Minei R, Tabata H, Ikemura T, Wada K, Wada Y, Nagata H, Iwasaki Y. Unsupervised AI reveals insect species-specific genome signatures. PeerJ 2024; 12:e17025. [PMID: 38464746 PMCID: PMC10924456 DOI: 10.7717/peerj.17025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2024] [Indexed: 03/12/2024] Open
Abstract
Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the "model organism" for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.
Collapse
Affiliation(s)
- Yui Sawada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Ryuhei Minei
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Hiromasa Tabata
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Hiroshi Nagata
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| | - Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Nagahama-shi, Tamura-cho, Japan
| |
Collapse
|
9
|
Lu M, Wan W, Li Y, Li H, Sun B, Yu K, Zhao J, Franzo G, Su S. Codon usage bias analysis of the spike protein of human coronavirus 229E and its host adaptability. Int J Biol Macromol 2023; 253:127319. [PMID: 37820917 DOI: 10.1016/j.ijbiomac.2023.127319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 09/28/2023] [Accepted: 10/06/2023] [Indexed: 10/13/2023]
Abstract
Human coronavirus 229E (HCoV-229E) represents one of the known coronaviruses capable of infecting humans and causes mild respiratory symptoms. It is also considered to have a zoonotic source, originating from animals and being transmitted the humans. In this study, a comprehensive phylogenetic and codon usage analysis of the spike (S) gene of HCoV-229E was conducted. Utilizing phylogenetic analysis and principal component analysis, HCoV-229E was categorized into four distinct clusters, each demonstrating unique host affiliations. Furthermore, it was observed that the codon usage bias within the S gene of HCoV-229E is relatively low, primarily influenced by natural selection patterns, with contributions from mutation pressure and dinucleotide abundance. Comparative analysis involving Codon Adaptation Index (CAI) and Relative Codon Deoptimization Index (RCDI) revealed that the codon usage pattern of HCoV-229E mirrors more closely that of camels, as opposed to alpacas and humans. The elucidation of the codon usage pattern within HCoV-229E, which we have meticulously examined, offers valuable insights for a more comprehensive comprehension of viral features, history, and evolutionary trajectory.
Collapse
Affiliation(s)
- Meng Lu
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Wenbo Wan
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Yuxing Li
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Haipeng Li
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Bowen Sun
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Kang Yu
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Jin Zhao
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China
| | - Giovanni Franzo
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16, Legnaro 35020, PD, Italy
| | - Shuo Su
- Shanghai Institute of Infectious Disease and Biosecurity, School of Public Health, Fudan University, 131 Dong'an Road, Shanghai 200032, People's Republic of China.
| |
Collapse
|
10
|
Lu Y, Wang W, Liu H, Li Y, Yan G, Franzo G, Dai J, He WT. Mutation and codon bias analysis of the spike protein of Omicron, the recent variant of SARS-CoV-2. Int J Biol Macromol 2023; 250:126080. [PMID: 37536405 DOI: 10.1016/j.ijbiomac.2023.126080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 07/29/2023] [Indexed: 08/05/2023]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Omicron variant is a heavily mutated virus and designated as a variant of concern. To investigate the codon usage pattern of this new variant, we performed mutation and codon bias analysis for Omicron as well as for its sub-lineages BA.1 and BA.2 and compared them with the original SARS-CoV-2 and the Delta variant sequences obtained in this study. Our results indicate that the sub-lineage BA.1 and BA.2 have up to 23 sites of difference on the spike protein, which have minimal impact on function. The Omicron variant and its sub-lineages have similar codon usage patterns and A/U ending codons appear to be preferred over G/C ending codons. The Omicron has a lower degree of codon usage bias in spite of evidence that natural selection, mutation pressure and dinucleotide abundance shape the codon usage bias of Omicron, with natural selection being more significant on BA.2 than the other sub-lineages of Omicron. The codon usage pattern of Omicron variant that we explored provides valid information for a clearer understanding of Omicron and its sub-lineages, which could find application in vaccine development and optimization.
Collapse
Affiliation(s)
- Yunbiao Lu
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China
| | - Weixiu Wang
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China
| | - Hao Liu
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China
| | - Yue Li
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China
| | - Ge Yan
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China
| | - Giovanni Franzo
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16, Legnaro 35020, PD, Italy
| | - Jianjun Dai
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China.
| | - Wan-Ting He
- School of Pharmacy, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, People's Republic of China.
| |
Collapse
|
11
|
Cohen D. General Designs Reveal Distinct Codes in Protein-Coding and Non-Coding Human DNA. Genes (Basel) 2022; 13:1970. [PMID: 36360206 PMCID: PMC9690640 DOI: 10.3390/genes13111970] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/19/2022] [Accepted: 10/22/2022] [Indexed: 08/27/2023] Open
Abstract
This study seeks to investigate distinct signatures and codes within different genomic sequence locations of the human genome. The promoter and other non-coding regions contain sites for the binding of biological particles, for processes such as transcription regulation. The specific rules and sequence codes that govern this remain poorly understood. To derive these (codes), the general designs of sequence are investigated. Genomic signatures are a powerful tool for assessing the general designs of sequence, and cross-comparing different genomic regions for their distinct sequence properties. Through these genomic signatures, the relative non-random properties of sequences are also assessed. Furthermore, a binary components analysis is carried out making use of information theory ideas, to study the RY (purine/pyrimidine), WS (weak/strong) and KM (keto/amino) signatures in the sequences. From this comparison, it is possible to identify the relative importance of these properties within the various protein-coding and non-coding genomic locations. The results show that coding DNA has a strongly non-random WS signature, which reflects the genetic code, and the hydrogen-bond base pairing of codon-anti-codon interactions. In contrast, non-coding locations, such as the promoter, contain a distinct genomic signature. A prominent feature throughout non-coding DNA is a highly non-random RY signature, which is very different in nature to coding DNA, and suggests a structural-based RY code. This marks progress towards deciphering the unknown code(s) in non-protein-coding DNA, and a further understanding of the coding DNA. Additionally, it unravels how DNA carries information. These findings have implications for the most fundamental principles of biology, including knowledge of gene regulation, development and disease.
Collapse
Affiliation(s)
- Dana Cohen
- Ronin Institute, 127 Haddon Pl, Montclair, NJ 07043-2314, USA
| |
Collapse
|
12
|
Abstract
The human genome carries a vast amount of information within its DNA sequences. The chemical bases A, T, C, and G are the basic units of information content, that are arranged into patterns and codes. Expansive areas of the genome contain codes that are not yet well understood. To decipher these, mathematical and computational tools are applied here to study genomic signatures or general designs of sequences. A novel binary components analysis is devised and utilized. This seeks to isolate the physical and chemical properties of DNA bases, which reveals sequence design and function. Here, information theory tools break down the information content within DNA bases, in order to study them in isolation for their genomic signatures and non-random properties. In this way, the RY (purine/pyrimidine), WS (weak/strong), and KM (keto/amino) general designs are observed in the sequences. The results show that RY, KM, and WS components have a similar and stable overall profile across all human chromosomes. It reveals that the RY property of a sequence is most distant from randomness in the human genome with respect to the genomic signatures. This is true across all human chromosomes. It is concluded that there exists a widespread potential RY code, and furthermore, that this is likely a structural code. Ascertaining this feature of general design, and potential RY structural code has far-reaching implications. This is because it aids in the understanding of cell biology, growth, and development, as well as downstream in the study of human disease and potential drug design.
Collapse
|
13
|
Iwasaki Y, Ikemura T, Wada K, Wada Y, Abe T. Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands. BMC Genomics 2022; 23:497. [PMID: 35804296 PMCID: PMC9264310 DOI: 10.1186/s12864-022-08664-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/31/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data. RESULTS In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes. CONCLUSION Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan.
| | - Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan
| | - Takashi Abe
- Smart Information Systems, Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan.
| |
Collapse
|
14
|
Iwasaki Y, Abe T, Wada K, Wada Y, Ikemura T. Unsupervised explainable AI for molecular evolutionary study of forty thousand SARS-CoV-2 genomes. BMC Microbiol 2022; 22:73. [PMID: 35272618 PMCID: PMC8907386 DOI: 10.1186/s12866-022-02484-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 02/28/2022] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Unsupervised AI (artificial intelligence) can obtain novel knowledge from big data without particular models or prior knowledge and is highly desirable for unveiling hidden features in big data. SARS-CoV-2 poses a serious threat to public health and one important issue in characterizing this fast-evolving virus is to elucidate various aspects of their genome sequence changes. We previously established unsupervised AI, a BLSOM (batch-learning SOM), which can analyze five million genomic sequences simultaneously. The present study applied the BLSOM to the oligonucleotide compositions of forty thousand SARS-CoV-2 genomes. RESULTS While only the oligonucleotide composition was given, the obtained clusters of genomes corresponded primarily to known main clades and internal divisions in the main clades. Since the BLSOM is explainable AI, it reveals which features of the oligonucleotide composition are responsible for clade clustering. Additionally, BLSOM also provided information concerning the special genomic region possibly undergoing RNA modifications. CONCLUSIONS The BLSOM has powerful image display capabilities and enables efficient knowledge discovery about viral evolutionary processes, and it can complement phylogenetic methods based on sequence alignment.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Takashi Abe
- Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan
| | - Kennosuke Wada
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Yoshiko Wada
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan
| | - Toshimichi Ikemura
- Nagahama Institute of Bio-Science and Technology, Shiga-ken, Nagahama, 526-0829, Japan. .,National Institute of Genetics, Mishima, Shizuoka-ken, 411-8540, Japan.
| |
Collapse
|
15
|
Gull S, Minhas F. AMP 0: Species-Specific Prediction of Anti-microbial Peptides Using Zero and Few Shot Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:275-283. [PMID: 32750857 DOI: 10.1109/tcbb.2020.2999399] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Evolution of drug-resistant microbial species is one of the major challenges to global health. Development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by low-throughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are non-targeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have used zero and few shot machine learning to develop a targeted antimicrobial peptide activity predictor called AMP0. The proposed predictor takes the sequence of a peptide and any N/C-termini modifications together with the genomic sequence of a microbial species to generate targeted predictions. Cross-validation results show that the proposed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner with only a small number of training examples for novel species. AMP0 webserver is available at http://ampzero.pythonanywhere.com.
Collapse
|
16
|
Time-Series Trend of Pandemic SARS-CoV-2 Variants Visualized Using Batch-Learning Self-Organizing Map for Oligonucleotide Compositions. DATA SCIENCE JOURNAL 2021. [DOI: 10.5334/dsj-2021-029] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
17
|
Khorsand P, Denti L, Bonizzoni P, Chikhi R, Hormozdiari F. Comparative genome analysis using sample-specific string detection in accurate long reads. BIOINFORMATICS ADVANCES 2021; 1:vbab005. [PMID: 36700094 PMCID: PMC9710709 DOI: 10.1093/bioadv/vbab005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Luca Denti
- Department of Computational Biology, Institut Pasteur, Paris 75015, France
| | | | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, 20126, Italy,To whom correspondence should be addressed. or or
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Paris 75015, France,To whom correspondence should be addressed. or or
| | - Fereydoun Hormozdiari
- Genome Center, UC Davis, Davis, CA 95616, USA,UC Davis MIND Institute, Sacramento, CA 95817, USA,Department of Biochemistry and Molecular Medicine, Sacramento, UC Davis, Sacramento, CA 95817, USA,To whom correspondence should be addressed. or or
| |
Collapse
|
18
|
Franzo G. SARS-CoV-2 and other human coronavirus show genome patterns previously associated to reduced viral recognition and altered immune response. Sci Rep 2021; 11:10696. [PMID: 34021237 PMCID: PMC8139983 DOI: 10.1038/s41598-021-90278-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 05/10/2021] [Indexed: 12/23/2022] Open
Abstract
A new pandemic caused by the betacoronavirus SARS-CoV-2 originated in China in late 2019. Although often asymptomatic, a relevant percentage of affected people can develop severe pneumonia. Initial evidence suggests that dysregulation of the immune response could contribute to the pathogenesis, as previously demonstrated for SARS-CoV. The presence of genome composition features involved in delaying viral recognition is herein investigated for human coronaviruses (HCoVs), with a special emphasis on SARS-CoV-2. A broad collection of HCoVs polyprotein, envelope, matrix, nucleocapsid and spike coding sequences was downloaded and several statistics representative of genome composition and codon bias were investigated. A model able to evaluate and test the presence of a significant under- or over-representation of dinucleotide pairs while accounting for the underlying codon bias and protein sequence was also implemented. The study revealed the significant under-representation of CpG dinucleotide pair in all HcoV, but especially in SARS-CoV and even more in SARS-CoV-2. The presence of forces acting to minimize CpG content was confirmed by relative synonymous codon usage pattern. Codons containing the CpG pair were severely under-represented, primarily in the polyprotein and spike coding sequences of SARS-CoV-2. Additionally, a significant under-representation of the TpA pair was observed in the N and S region of SARS-CoV and SARS-CoV-2. Increasing experimental evidence has proven that CpG and TpA are targeted by innate antiviral host defences, contributing both to RNA degradation and RIG-1 mediated interferon production. The low content of these dinucleotides could contribute to a delayed interferon production, dysregulated immune response, higher viral replication and poor outcome. Significantly, the RIG-1 signalling pathway was proven to be defective in elderlies, suggesting a likely interaction between limited viral recognition and lower responsiveness in interferon production that could justify the higher disease severity and mortality in older patients.
Collapse
Affiliation(s)
- Giovanni Franzo
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16, 35020, Legnaro, Padua, Italy.
| |
Collapse
|
19
|
Franzo G, Tucciarone CM, Legnardi M, Cecchinato M. Effect of genome composition and codon bias on infectious bronchitis virus evolution and adaptation to target tissues. BMC Genomics 2021; 22:244. [PMID: 33827429 PMCID: PMC8025453 DOI: 10.1186/s12864-021-07559-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 03/26/2021] [Indexed: 11/10/2022] Open
Abstract
Background Infectious bronchitis virus (IBV) is one of the most relevant viruses affecting the poultry industry, and several studies have investigated the factors involved in its biological cycle and evolution. However, very few of those studies focused on the effect of genome composition and the codon bias of different IBV proteins, despite the remarkable increase in available complete genomes. In the present study, all IBV complete genomes were downloaded (n = 383), and several statistics representative of genome composition and codon bias were calculated for each protein-coding sequence, including but not limited to, the nucleotide odds ratio, relative synonymous codon usage and effective number of codons. Additionally, viral codon usage was compared to host codon usage based on a collection of highly expressed genes in IBV target and nontarget tissues. Results The results obtained demonstrated a significant difference among structural, non-structural and accessory proteins, especially regarding dinucleotide composition, which appears under strong selective forces. In particular, some dinucleotide pairs, such as CpG, a probable target of the host innate immune response, are underrepresented in genes coding for pp1a, pp1ab, S and N. Although genome composition and dinucleotide bias appear to affect codon usage, additional selective forces may act directly on codon bias. Variability in relative synonymous codon usage and effective number of codons was found for different proteins, with structural proteins and polyproteins being more adapted to the codon bias of host target tissues. In contrast, accessory proteins had a more biased codon usage (i.e., lower number of preferred codons), which might contribute to the regulation of their expression level and timing throughout the cell cycle. Conclusions The present study confirms the existence of selective forces acting directly on the genome and not only indirectly through phenotype selection. This evidence might help understanding IBV biology and in developing attenuated strains without affecting the protein phenotype and therefore immunogenicity. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07559-5.
Collapse
Affiliation(s)
- Giovanni Franzo
- Microbiology and Infectious Diseases, Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16 - 35020 Legnaro, Padua, Italy.
| | - Claudia Maria Tucciarone
- Microbiology and Infectious Diseases, Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16 - 35020 Legnaro, Padua, Italy
| | - Matteo Legnardi
- Microbiology and Infectious Diseases, Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16 - 35020 Legnaro, Padua, Italy
| | - Mattia Cecchinato
- Microbiology and Infectious Diseases, Department of Animal Medicine, Production and Health (MAPS), University of Padua, Viale dell'Università 16 - 35020 Legnaro, Padua, Italy
| |
Collapse
|
20
|
Iwasaki Y, Abe T, Ikemura T. Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes. BMC Microbiol 2021; 21:89. [PMID: 33757449 PMCID: PMC7987243 DOI: 10.1186/s12866-021-02158-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 03/15/2021] [Indexed: 12/24/2022] Open
Abstract
Background When a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. In the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 genomes and investigated how these compositions changed time-dependently in the human cellular environment. We also compared the oligonucleotide compositions of SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. Results Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. Interestingly, the compositions of these oligonucleotides changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. Conclusions Clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-021-02158-6.
Collapse
Affiliation(s)
- Yuki Iwasaki
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Shiga, Japan
| | - Takashi Abe
- Graduate School of Science and Technology, Niigata University, Niigata, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Shiga, Japan.
| |
Collapse
|
21
|
Nelson WC, Tully BJ, Mobberley JM. Biases in genome reconstruction from metagenomic data. PeerJ 2020; 8:e10119. [PMID: 33194386 PMCID: PMC7605220 DOI: 10.7717/peerj.10119] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Accepted: 09/16/2020] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND Advances in sequencing, assembly, and assortment of contigs into species-specific bins has enabled the reconstruction of genomes from metagenomic data (MAGs). Though a powerful technique, it is difficult to determine whether assembly and binning techniques are accurate when applied to environmental metagenomes due to a lack of complete reference genome sequences against which to check the resulting MAGs. METHODS We compared MAGs derived from an enrichment culture containing ~20 organisms to complete genome sequences of 10 organisms isolated from the enrichment culture. Factors commonly considered in binning software-nucleotide composition and sequence repetitiveness-were calculated for both the correctly binned and not-binned regions. This direct comparison revealed biases in sequence characteristics and gene content in the not-binned regions. Additionally, the composition of three public data sets representing MAGs reconstructed from the Tara Oceans metagenomic data was compared to a set of representative genomes available through NCBI RefSeq to verify that the biases identified were observable in more complex data sets and using three contemporary binning software packages. RESULTS Repeat sequences were frequently not binned in the genome reconstruction processes, as were sequence regions with variant nucleotide composition. Genes encoded on the not-binned regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Our results support genome reconstruction as a robust process and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function; however, population-level genotypic heterogeneity in natural populations, such as uneven distribution of plasmids, can lead to incorrect inferences.
Collapse
Affiliation(s)
- William C. Nelson
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA
| | - Benjamin J. Tully
- Department of Biological Sciences, Marine Environmental Biology Section, University of Southern California, Los Angeles, CA, USA
- Center for Dark Energy Biosphere Investigations, University of Southern California, Los Angeles, CA, USA
| | - Jennifer M. Mobberley
- Chemical and Biological Signature Science Group, Pacific Northwest National Laboratory, Richland, WA, USA
| |
Collapse
|
22
|
3-Hydroxypyridine Dehydrogenase HpdA Is Encoded by a Novel Four-Component Gene Cluster and Catalyzes the First Step of 3-Hydroxypyridine Catabolism in Ensifer adhaerens HP1. Appl Environ Microbiol 2020; 86:AEM.01313-20. [PMID: 32709720 DOI: 10.1128/aem.01313-20] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 07/14/2020] [Indexed: 11/20/2022] Open
Abstract
3-Hydroxypyridine (3HP) is an important natural pyridine derivative. Ensifer adhaerens HP1 can utilize 3HP as its sole sources of carbon, nitrogen, and energy to grow, but the genes responsible for the degradation of 3HP remain unknown. In this study, we predicted that a gene cluster, designated 3hpd, might be responsible for the degradation of 3HP. The analysis showed that the initial hydroxylation of 3HP in E. adhaerens HP1 was catalyzed by a four-component dehydrogenase (HpdA1A2A3A4) and led to the formation of 2,5-dihydroxypyridine (2,5-DHP). In addition, the SRPBCC component in HpdA existed as a separate subunit, which is different from other SRPBCC-containing molybdohydroxylases acting on N-heterocyclic aromatic compounds. Moreover, the results demonstrated that the phosphoenolpyruvate (PEP)-utilizing protein and pyruvate-phosphate dikinase were involved in the HpdA activity, and the presence of the gene cluster 3hpd was discovered in the genomes of diverse microbial strains. Our findings provide a better understanding of the microbial degradation of pyridine derivatives in nature and indicated that further research on the origin of the discovered four-component dehydrogenase with a separate SRPBCC domain and the function of PEP-utilizing protein and pyruvate-phosphate dikinase might be of great significance.IMPORTANCE 3-Hydroxypyridine is an important building block for the synthesis of drugs, herbicides, and antibiotics. Although the microbial degradation of 3-hydroxypyridine has been studied for many years, the molecular mechanisms remain unclear. Here, we show that 3hpd is responsible for the catabolism of 3-hydroxypyridine. The 3hpd gene cluster was found to be widespread in Actinobacteria, Rubrobacteria, Thermoleophilia, and Alpha-, Beta-, and Gammaproteobacteria, and the genetic organization of the 3hpd gene clusters in these bacteria shows high diversity. Our findings provide new insight into the catabolism of 3-hydroxypyridine in bacteria.
Collapse
|
23
|
Comparative analysis, distribution, and characterization of microsatellites in Orf virus genome. Sci Rep 2020; 10:13852. [PMID: 32807836 PMCID: PMC7431841 DOI: 10.1038/s41598-020-70634-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 07/01/2020] [Indexed: 11/09/2022] Open
Abstract
Genome-wide in-silico identification of microsatellites or simple sequence repeats (SSRs) in the Orf virus (ORFV), the causative agent of contagious ecthyma has been carried out to investigate the type, distribution and its potential role in the genome evolution. We have investigated eleven ORFV strains, which resulted in the presence of 1,036-1,181 microsatellites per strain. The further screening revealed the presence of 83-107 compound SSRs (cSSRs) per genome. Our analysis indicates the dinucleotide (76.9%) repeats to be the most abundant, followed by trinucleotide (17.7%), mononucleotide (4.9%), tetranucleotide (0.4%) and hexanucleotide (0.2%) repeats. The Relative Abundance (RA) and Relative Density (RD) of these SSRs varied between 7.6-8.4 and 53.0-59.5 bp/kb, respectively. While in the case of cSSRs, the RA and RD ranged from 0.6-0.8 and 12.1-17.0 bp/kb, respectively. Regression analysis of all parameters like the incident of SSRs, RA, and RD significantly correlated with the GC content. But in a case of genome size, except incident SSRs, all other parameters were non-significantly correlated. Nearly all cSSRs were composed of two microsatellites, which showed no biasedness to a particular motif. Motif duplication pattern, such as, (C)-x-(C), (TG)-x-(TG), (AT)-x-(AT), (TC)- x-(TC) and self-complementary motifs, such as (GC)-x-(CG), (TC)-x-(AG), (GT)-x-(CA) and (TC)-x-(AG) were observed in the cSSRs. Finally, in-silico polymorphism was assessed, followed by in-vitro validation using PCR analysis and sequencing. The thirteen polymorphic SSR markers developed in this study were further characterized by mapping with the sequence present in the database. The results of the present study indicate that these SSRs could be a useful tool for identification, analysis of genetic diversity, and understanding the evolutionary status of the virus.
Collapse
|
24
|
Wada K, Wada Y, Ikemura T. Time-series analyses of directional sequence changes in SARS-CoV-2 genomes and an efficient search method for candidates for advantageous mutations for growth in human cells. Gene 2020; 763S:100038. [PMID: 32835214 PMCID: PMC7409725 DOI: 10.1016/j.gene.2020.100038] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 07/09/2020] [Accepted: 08/04/2020] [Indexed: 11/25/2022]
Abstract
We first conducted time-series analysis of mono- and dinucleotide composition for over 10,000 SARS-CoV-2 genomes, as well as over 1500 Zaire ebolavirus genomes, and found clear time-series changes in the compositions on a monthly basis, which should reflect viral adaptations for efficient growth in human cells. We next developed a sequence alignment free method that extensively searches for advantageous mutations and rank them in an increase level for their intrapopulation frequency. Time-series analysis of occurrences of oligonucleotides of diverse lengths for SARS-CoV-2 genomes revealed seven distinctive mutations that rapidly expanded their intrapopulation frequency and are thought to be candidates of advantageous mutations for the efficient growth in human cells. Time-series change in mono- and dinucleotide compositions in SARS-CoV-2 genome The time-series change found for SARS-CoV-2 differed from that of Zaire ebolavirus. Sequence alignment-free method to search advantageous mutation in viral genomes Seven mutations of SARS-CoV-2 rapidly expanding their population frequency A method other than phylogenetic tree construction for viral evolutionary study
Collapse
Affiliation(s)
- Kennosuke Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken 526-0829, Japan
| | - Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken 526-0829, Japan
| | - Toshimichi Ikemura
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken 526-0829, Japan
| |
Collapse
|
25
|
Abe T, Akazawa Y, Toyoda A, Niki H, Baba T. Batch-Learning Self-Organizing Map Identifies Horizontal Gene Transfer Candidates and Their Origins in Entire Genomes. Front Microbiol 2020; 11:1486. [PMID: 32719664 PMCID: PMC7350273 DOI: 10.3389/fmicb.2020.01486] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 06/08/2020] [Indexed: 02/05/2023] Open
Abstract
Horizontal gene transfer (HGT) has been widely suggested to play a critical role in the environmental adaptation of microbes; however, the number and origin of the genes in microbial genomes obtained through HGT remain unknown as the frequency of detected HGT events is generally underestimated, particularly in the absence of information on donor sequences. As an alternative to phylogeny-based methods that rely on sequence alignments, we have developed an alignment-free clustering method on the basis of an unsupervised neural network “Batch-Learning Self-Organizing Map (BLSOM)” in which sequence fragments are clustered based solely on oligonucleotide similarity without taxonomical information, to detect HGT candidates and their origin in entire genomes. By mapping the microbial genomic sequences on large-scale BLSOMs constructed with nearly all prokaryotic genomes, HGT candidates can be identified, and their origin assigned comprehensively, even for microbial genomes that exhibit high novelty. By focusing on two types of Alphaproteobacteria, specifically psychrotolerant Sphingomonas strains from an Antarctic lake, we detected HGT candidates using BLSOM and found higher proportions of HGT candidates from organisms belonging to Betaproteobacteria in the genomes of these two Antarctic strains compared with those of continental strains. Further, an origin difference was noted in the HGT candidates found in the two Antarctic strains. Although their origins were highly diversified, gene functions related to the cell wall or membrane biogenesis were shared among the HGT candidates. Moreover, analyses of amino acid frequency suggested that housekeeping genes and some HGT candidates of the Antarctic strains exhibited different characteristics to other continental strains. Lys, Ser, Thr, and Val were the amino acids found to be increased in the Antarctic strains, whereas Ala, Arg, Glu, and Leu were decreased. Our findings strongly suggest a low-temperature adaptation process for microbes that may have arisen convergently as an independent evolutionary strategy in each Antarctic strain. Hence, BLSOM analysis could serve as a powerful tool in not only detecting HGT candidates and their origins in entire genomes, but also in providing novel perspectives into the environmental adaptations of microbes.
Collapse
Affiliation(s)
- Takashi Abe
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata, Japan
| | - Yu Akazawa
- Department of Information Engineering, Faculty of Engineering, Niigata University, Niigata, Japan
| | - Atsushi Toyoda
- Comparative Genomics Laboratory, National Institute of Genetics, Mishima, Japan.,Advanced Genomics Center, National Institute of Genetics, Mishima, Japan
| | - Hironori Niki
- Microbial Physiology Laboratory, National Institute of Genetics, Mishima, Japan
| | - Tomoya Baba
- Advanced Genomics Center, National Institute of Genetics, Mishima, Japan.,Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Tokyo, Japan
| |
Collapse
|
26
|
Structures and stability of simple DNA repeats from bacteria. Biochem J 2020; 477:325-339. [PMID: 31967649 PMCID: PMC7015867 DOI: 10.1042/bcj20190703] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 12/20/2019] [Accepted: 01/03/2020] [Indexed: 01/12/2023]
Abstract
DNA is a fundamentally important molecule for all cellular organisms due to its biological role as the store of hereditary, genetic information. On the one hand, genomic DNA is very stable, both in chemical and biological contexts, and this assists its genetic functions. On the other hand, it is also a dynamic molecule, and constant changes in its structure and sequence drive many biological processes, including adaptation and evolution of organisms. DNA genomes contain significant amounts of repetitive sequences, which have divergent functions in the complex processes that involve DNA, including replication, recombination, repair, and transcription. Through their involvement in these processes, repetitive DNA sequences influence the genetic instability and evolution of DNA molecules and they are located non-randomly in all genomes. Mechanisms that influence such genetic instability have been studied in many organisms, including within human genomes where they are linked to various human diseases. Here, we review our understanding of short, simple DNA repeats across a diverse range of bacteria, comparing the prevalence of repetitive DNA sequences in different genomes. We describe the range of DNA structures that have been observed in such repeats, focusing on their propensity to form local, non-B-DNA structures. Finally, we discuss the biological significance of such unusual DNA structures and relate this to studies where the impacts of DNA metabolism on genetic stability are linked to human diseases. Overall, we show that simple DNA repeats in bacteria serve as excellent and tractable experimental models for biochemical studies of their cellular functions and influences.
Collapse
|
27
|
Goussarov G, Cleenwerck I, Mysara M, Leys N, Monsieurs P, Tahon G, Carlier A, Vandamme P, Van Houdt R. PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing. Bioinformatics 2020; 36:2337-2344. [PMID: 31899493 PMCID: PMC7178395 DOI: 10.1093/bioinformatics/btz964] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 11/21/2019] [Accepted: 12/30/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. RESULTS Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. AVAILABILITY AND IMPLEMENTATION The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Ilse Cleenwerck
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Mohamed Mysara
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Natalie Leys
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Pieter Monsieurs
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Guillaume Tahon
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Aurélien Carlier
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
- LIPM, Université de Toulouse, INRAE, CNRS, Castanet-Tolosan, France
| | - Peter Vandamme
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium
| | - Rob Van Houdt
- Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| |
Collapse
|
28
|
Yan F, Fang J, Cao J, Wei Y, Liu R, Wang L, Xie Z. Halomonas piezotolerans sp. nov., a multiple-stress-tolerant bacterium isolated from a deep-sea sediment sample of the New Britain Trench. Int J Syst Evol Microbiol 2020; 70:2560-2568. [PMID: 32129736 DOI: 10.1099/ijsem.0.004069] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A piezotolerant, H2O2-tolerant, heavy-metal-tolerant, slightly halophilic bacterium (strain NBT06E8T) was isolated from a deep-sea sediment sample collected from the New Britain Trench at depth of 8900 m. The strain was aerobic, motile, Gram-stain-negative, rod-shaped, oxidase-positive and catalase-positive. Growth of the strain was observed at 4-45 °C (optimum, 30 °C), at pH 5-11 (optimum, pH 8-9) and in 0.5-21 % (w/v) NaCl (optimum, 3-7 %). The optimum pressure for growth was 0.1-30 MPa with tolerance up to 60 MPa. Under optimum growth conditions, the strain could tolerate 15 mM H2O2. Resuls of 16S rRNA gene sequence analysis showed that strain NBT06E8T is closely related to Halomonas aquamarina DSM 30161T (99.5%), Halomonas meridiana DSM 5425T (99.43%) and Halomonas axialensis Althf1T (99.35%). The digital DNA-DNA hybridization values between strain NBT06E8T and the three related type strains, H. aquamarina, H. meridiana and H. axialensis, were 30.5±2.4 %, 30.7±2.5% and 31.5±2.5 %, respectively. The average nucleotide identity values between strain NBT06E8T and the three related type strains were 86.26, 86.26 and 83.63 %, respectively. The major fatty acids were summed feature 8 (C18 : 1 ω7c and/or C18 : 1 ω6c) and C16 : 0. The predominant respiratory quinone detected was ubiquinone-9 (Q-9). Based on its phenotypic and phylogenetic characteristics, we conclude that strain NBT06E8T represents a novel species of the genus Halomonas, for which the name Halomonas piezotolerans sp. nov. is proposed (type strain NBT06E8T= MCCC 1K04228T=KCTC 72680T).
Collapse
Affiliation(s)
- Fangfang Yan
- Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Jiasong Fang
- Department of Natural Sciences, Hawaii Pacific University, Honolulu, HI 96813, USA.,Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China.,Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Junwei Cao
- National Engineering Research Center for Oceanic Fisheries, Shanghai Ocean University, Shanghai 201306, PR China.,Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Yuli Wei
- National Engineering Research Center for Oceanic Fisheries, Shanghai Ocean University, Shanghai 201306, PR China.,Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Rulong Liu
- National Engineering Research Center for Oceanic Fisheries, Shanghai Ocean University, Shanghai 201306, PR China.,Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Li Wang
- National Engineering Research Center for Oceanic Fisheries, Shanghai Ocean University, Shanghai 201306, PR China.,Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China
| | - Zhe Xie
- Shanghai Engineering Research Center of Hadal Science and Technology, College of Marine Sciences, Shanghai Ocean University, Shanghai 201306, PR China.,Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China
| |
Collapse
|
29
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
30
|
Chiodi A, Comandatore F, Sassera D, Petroni G, Bandi C, Brilli M. SeqDeχ: A Sequence Deconvolution Tool for Genome Separation of Endosymbionts From Mixed Sequencing Samples. Front Genet 2019; 10:853. [PMID: 31608107 PMCID: PMC6761303 DOI: 10.3389/fgene.2019.00853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Accepted: 08/15/2019] [Indexed: 12/04/2022] Open
Abstract
In recent years, the advent of NGS technology has made genome sequencing much cheaper than in the past; the high parallelization capability and the possibility to sequence more than one organism at once have opened the door to processing whole symbiotic consortia. However, this approach needs the development of specific bioinformatics tools able to analyze these data. In this work, we describe SeqDex, a tool that starts from a preliminary assembly obtained from sequencing a mixture of DNA from different organisms, to identify the contigs coming from one organism of interest. SeqDex is a fully automated machine learning–based tool exploiting partial taxonomic affiliations and compositional analysis to predict the taxonomic affiliations of contigs in an assembly. In literature, there are few methods able to deconvolve host–symbiont datasets, and most of them heavily rely on user curation and are therefore time consuming. The problem has strong similarities with metagenomic studies, where mixed samples are sequenced and the bioinformatics challenge is trying to separate contigs on the basis of their source organism; however, in symbiotic systems, additional information can be exploited to improve the output. To assess the ability of SeqDex to deconvolve host–symbiont datasets, we compared it to state-of-the-art methods for metagenomic binning and for host–symbiont deconvolution on three study cases. The results point out the good performances of the presented tool that, in addition to the ease of use and customization potential, make SeqDex a useful tool for rapid identification of endosymbiont sequences.
Collapse
Affiliation(s)
- Alice Chiodi
- Department of Earth and Environmental Sciences, University of Pavia, Pavia, Italy.,Department of Biosciences, University of Milan, Milan, Italy
| | - Francesco Comandatore
- Pediatric Clinical Research Center "Romeo ed Enrica Invernizzi", University of Milan, Milan, Italy.,Department of Biomedical and Clinical Sciences "L. Sacco", University of Milan, Milan, Italy
| | - Davide Sassera
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy
| | | | - Claudio Bandi
- Department of Biosciences, University of Milan, Milan, Italy.,Pediatric Clinical Research Center "Romeo ed Enrica Invernizzi", University of Milan, Milan, Italy
| | - Matteo Brilli
- Department of Biosciences, University of Milan, Milan, Italy.,Pediatric Clinical Research Center "Romeo ed Enrica Invernizzi", University of Milan, Milan, Italy
| |
Collapse
|
31
|
Flores-Uribe J, Philosof A, Sharon I, Fridman S, Larom S, Béjà O. A novel uncultured marine cyanophage lineage with lysogenic potential linked to a putative marine Synechococcus 'relic' prophage. ENVIRONMENTAL MICROBIOLOGY REPORTS 2019; 11:598-604. [PMID: 31125500 DOI: 10.1111/1758-2229.12773] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2018] [Accepted: 05/23/2019] [Indexed: 06/09/2023]
Abstract
Marine cyanobacteria are important contributors to primary production in the ocean and their viruses (cyanophages) affect the ocean microbial communities. Despite reports of lysogeny in marine cyanobacteria, a genome sequence of such temperate cyanophages remains unknown although genomic analysis indicate potential for lysogeny in certain marine cyanophages. Using assemblies from Red Sea and Tara Oceans metagenomes, we recovered genomes of a novel uncultured marine cyanophage lineage, which contain, in addition to common cyanophage genes, a phycobilisome degradation protein NblA, an integrase and a split DNA polymerase. The DNA polymerase forms a monophyletic clade with a DNA polymerase from a genomic island in Synechococcus WH8016. The island contains a relic prophage that does not resemble any previously reported cyanophage but shares several genes with the newly identified cyanophages reported here. Metagenomic recruitment indicates that the novel cyanophages are widespread, albeit at low abundance. Here, we describe a novel potentially lysogenic cyanophage family, their abundance and distribution in the marine environment.
Collapse
Affiliation(s)
- José Flores-Uribe
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, 32000, Israel
| | - Alon Philosof
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, 32000, Israel
- Department of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, 91106, USA
| | - Itai Sharon
- Migal Galilee Research Institute, Kiryat Shmona, 11016, Israel
- Tel Hai College, Upper Galilee, 12210, Israel
| | - Svetlana Fridman
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, 32000, Israel
| | - Shirley Larom
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, 32000, Israel
| | - Oded Béjà
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, 32000, Israel
| |
Collapse
|
32
|
Genetic evolution and codon usage analysis of NKX-2.5 gene governing heart development in some mammals. Genomics 2019; 112:1319-1329. [PMID: 31377427 DOI: 10.1016/j.ygeno.2019.07.023] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 07/26/2019] [Accepted: 07/31/2019] [Indexed: 11/21/2022]
Abstract
NKX-2.5 gene is responsible for cardiac development and its targeted disruption apprehends cardiac development at the linear heart tube stage. Bioinformatic analysis was employed to investigate the codon usage pattern and dN/dS of mammalian NKX-2.5 gene. The relative synonymous codon usage analysis revealed variation in codon usage and two synonymous codons namely ATA (Ile) and GTA (Val) were absent in NKX-2.5 gene across selected mammalian species suggesting that these two codons were possibly selected against during evolution. Parity rule 2 analysis of two and four fold amino acids showed CT bias whereas six-fold amino acids revealed GA bias. Neutrality analysis suggests that selection played a prominent role while mutation had a minor role. The dN/dS analysis suggests synonymous substitution played a significant role and it negatively correlated with p-distance of the gene. Purifying natural selection played a dominant role in the genetic evolution of NKX-2.5 gene in mammals.
Collapse
|
33
|
Delfino CM, Cerrudo CS, Biglione M, Oubiña JR, Ghiringhelli PD, Mathet VL. A comprehensive bioinformatic analysis of hepatitis D virus full-length genomes. J Viral Hepat 2018; 25:860-869. [PMID: 29406571 DOI: 10.1111/jvh.12876] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 01/02/2018] [Indexed: 12/15/2022]
Abstract
In association with hepatitis B virus (HBV), hepatitis delta virus (HDV) is a subviral agent that may promote severe acute and chronic forms of liver disease. Based on the percentage of nucleotide identity of the genome, HDV was initially classified into three genotypes. However, since 2006, the original classification has been further expanded into eight clades/genotypes. The intergenotype divergence may be as high as 35%-40% over the entire RNA genome, whereas sequence heterogeneity among the isolates of a given genotype is <20%; furthermore, HDV recombinants have been clearly demonstrated. The genetic diversity of HDV is related to the geographic origin of the isolates. This study shows the first comprehensive bioinformatic analysis of the complete available set of HDV sequences, using both nucleotide and protein phylogenies (based on an evolutionary model selection, gamma distribution estimation, tree inference and phylogenetic distance estimation), protein composition analysis and comparison (based on the presence of invariant residues, molecular signatures, amino acid frequencies and mono- and di-amino acid compositional distances), as well as amino acid changes in sequence evolution. Taking into account the congruent and consistent results of both nucleotide and amino acid analyses of GenBank available sequences (recorded as of January, 2017), we propose that the eight hepatitis D virus genotypes may be grouped into three large genogroups fully supported by their shared characteristics.
Collapse
Affiliation(s)
- C M Delfino
- Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET) - Universidad de Buenos Aires (UBA), Instituto de Investigaciones en Microbiología y Parasitología Médica, (IMPAM), Ciudad Autónoma de Buenos Aires, Argentina
| | - C S Cerrudo
- Departamento de Ciencia y Tecnología, Laboratorio de Ingeniería Genética y Biología Celular y Molecular - Área Virosis de Insectos (LIGBCM-AVI), Instituto de Microbiología Básica y Aplicada (IMBA), Universidad Nacional de Quilmes, Bernal, Provincia de Buenos Aires, Argentina
| | - M Biglione
- Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET) - Universidad de Buenos Aires (UBA), Instituto de Investigaciones Biomédicas en Retrovirus y SIDA (INBIRS), Ciudad Autónoma de Buenos Aires, Argentina
| | - J R Oubiña
- Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET) - Universidad de Buenos Aires (UBA), Instituto de Investigaciones en Microbiología y Parasitología Médica, (IMPAM), Ciudad Autónoma de Buenos Aires, Argentina
| | - P D Ghiringhelli
- Departamento de Ciencia y Tecnología, Laboratorio de Ingeniería Genética y Biología Celular y Molecular - Área Virosis de Insectos (LIGBCM-AVI), Instituto de Microbiología Básica y Aplicada (IMBA), Universidad Nacional de Quilmes, Bernal, Provincia de Buenos Aires, Argentina
| | - V L Mathet
- Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET) - Universidad de Buenos Aires (UBA), Instituto de Investigaciones en Microbiología y Parasitología Médica, (IMPAM), Ciudad Autónoma de Buenos Aires, Argentina
| |
Collapse
|
34
|
Franzo G, Segales J, Tucciarone CM, Cecchinato M, Drigo M. The analysis of genome composition and codon bias reveals distinctive patterns between avian and mammalian circoviruses which suggest a potential recombinant origin for Porcine circovirus 3. PLoS One 2018; 13:e0199950. [PMID: 29958294 PMCID: PMC6025852 DOI: 10.1371/journal.pone.0199950] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 06/15/2018] [Indexed: 01/30/2023] Open
Abstract
Members of the genus Circovirus are host-specific viruses, which are totally dependent on cell machinery for their replication. Consequently, certain mimicry of the host genome features is expected to maximize cellular replicative system exploitation and minimize the recognition by the innate immune system. In the present study, the analysis of several genome composition and codon bias parameters of circoviruses infecting avian and mammalian species demonstrated the presence of quite distinctive patterns between the two groups. Remarkably, a higher deviation from the expected values based only on mutational patterns was observed for mammalian circoviruses both at dinucleotide and codon levels. Accordingly, a stronger selective pressure was estimated to shape the genome of mammalian circoviruses, particularly in the Cap encoding gene, compared to avian circoviruses. These differences could be attributed to different physiological and immunological features of the two host classes and suggest a trade-off between a tendency to optimize the capsid protein translation while minimizing the recognition of the genome and the transcript molecules. Interestingly, the recently identified Porcine circovirus 3 (PCV-3) had an intermediate pattern in terms of genome composition and codon bias. Particularly, its Rep gene appeared closely related to other mammalian circoviruses (especially bat circoviruses) while the Cap gene more closely resembled avian circoviruses. These evidences, coupled with the high selective forces apparently modelling the PCV-3 Cap gene composition, suggest the potential recombinant origin, followed or preceded by a host jump, of this virus.
Collapse
Affiliation(s)
- Giovanni Franzo
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Legnaro, Padua, Italy
- * E-mail:
| | - Joaquim Segales
- Departament de Sanitat i Anatomia Animals, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain
- UAB, Centre de Recerca en Sanitat Animal (CReSA, IRTA- UAB), Campus de la Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain
| | - Claudia Maria Tucciarone
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Legnaro, Padua, Italy
| | - Mattia Cecchinato
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Legnaro, Padua, Italy
| | - Michele Drigo
- Department of Animal Medicine, Production and Health (MAPS), University of Padua, Legnaro, Padua, Italy
| |
Collapse
|
35
|
Abstract
The genus Lactobacillus encompasses a diversity of species that occur widely in nature and encode a plethora of metabolic pathways reflecting their adaptation to various ecological niches, including humans, animals, plants and food products. Accordingly, their functional attributes have been exploited industrially and several strains are commonly formulated as probiotics or starter cultures in the food industry. Although divergent evolutionary processes have yielded the acquisition and evolution of specialized functionalities, all Lactobacillus species share a small set of core metabolic properties, including the glycolysis pathway. Thus, the sequences of glycolytic enzymes afford a means to establish phylogenetic groups with the potential to discern species that are too closely related from a 16S rRNA standpoint. Here, we identified and extracted glycolysis enzyme sequences from 52 species, and carried out individual and concatenated phylogenetic analyses. We show that a glycolysis-based phylogenetic tree can robustly segregate lactobacilli into distinct clusters and discern very closely related species. We also compare and contrast evolutionary patterns with genome-wide features and transcriptomic patterns, reflecting genomic drift trends. Overall, results suggest that glycolytic enzymes provide valuable phylogenetic insights and may constitute practical targets for evolutionary studies.
Collapse
Affiliation(s)
- Katelyn Brandt
- 1Genomic Sciences Graduate Program, North Carolina State University, Raleigh, NC 27695, USA.,2Department of Food, Bioprocessing and Nutrition Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Rodolphe Barrangou
- 1Genomic Sciences Graduate Program, North Carolina State University, Raleigh, NC 27695, USA.,2Department of Food, Bioprocessing and Nutrition Sciences, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
36
|
Dissimilar substitution rates between two strands of DNA influence codon usage pattern in some human genes. Gene 2018; 645:179-187. [PMID: 29229516 DOI: 10.1016/j.gene.2017.12.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 12/05/2017] [Accepted: 12/07/2017] [Indexed: 11/23/2022]
Abstract
We illustrated the descriptive aspects of codon usage of some important human genes and their expression potential in E. coli. By comparing the results of various codon usage parameters, effects that are due to selection and mutational pressures have been deciphered. The variation in GC3s explains a significant proportion of the variation in codon usage patterns. The codons CGC, CGG, CTG and GCG showed strong positive correlation with GC3, which suggested that codon usage had been influenced by GC bias. We also found that ACC (Thr, RSCU-1.77), GCC (Ala, RSCU-1.67), CCC (Pro, RSCU-1.54), TCC (Ser, RSCU-1.47) were frequently used which signified that C was common at 2nd and 3rd codon positions. Correspondence analysis revealed that F1 axis had significant correlation with various GC contents suggesting that compositional properties under mutation pressure might affect codon usage bias. Nc-GC3 plot analysis suggested that both mutation pressure and natural selection might affect the codon usage bias which is also supported by neutrality plot analysis. The dinucleotide CT, TG and AG were significantly over-represented and CG, TA, AT, TT, and GT were underrepresented due to high rate of spontaneous mutation resulting from cytosine deamination.
Collapse
|
37
|
Quandt EM, Traverse CC, Ochman H. Local genic base composition impacts protein production and cellular fitness. PeerJ 2018; 6:e4286. [PMID: 29362699 PMCID: PMC5774297 DOI: 10.7717/peerj.4286] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 01/01/2018] [Indexed: 01/25/2023] Open
Abstract
The maintenance of a G + C content that is higher than the mutational input to a genome provides support for the view that selection serves to increase G + C contents in bacteria. Recent experimental evidence from Escherichia coli demonstrated that selection for increasing G + C content operates at the level of translation, but the precise mechanism by which this occurs is unknown. To determine the substrate of selection, we asked whether selection on G + C content acts across all sites within a gene or is confined to particular genic regions or nucleotide positions. We systematically altered the G + C contents of the GFP gene and assayed its effects on the fitness of strains harboring each variant. Fitness differences were attributable to the base compositional variation in the terminal portion of the gene, suggesting a connection to the folding of a specific protein feature. Variants containing sequence features that are thought to result in rapid translation, such as low G + C content and high levels of codon adaptation, displayed highly reduced growth rates. Taken together, our results show that purifying selection acting against A and T mutations most likely results from their tendency to increase the rate of translation, which can perturb the dynamics of protein folding.
Collapse
Affiliation(s)
- Erik M Quandt
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, United States of America
| | - Charles C Traverse
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, United States of America
| | - Howard Ochman
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, United States of America
| |
Collapse
|
38
|
Gatherer D. Genome Signatures, Self-Organizing Maps and Higher Order Phylogenies: A Parametric Analysis. Evol Bioinform Online 2017. [DOI: 10.1177/117693430700300001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Genome signatures are data vectors derived from the compositional statistics of DNA. The self-organizing map (SOM) is a neural network method for the conceptualisation of relationships within complex data, such as genome signatures. The various parameters of the SOM training phase are investigated for their effect on the accuracy of the resulting output map. It is concluded that larger SOMs, as well as taking longer to train, are less sensitive in phylogenetic classification of unknown DNA sequences. However, where a classification can be made, a larger SOM is more accurate. Increasing the number of iterations in the training phase of the SOM only slightly increases accuracy, without improving sensitivity. The optimal length of the DNA sequence k-mer from which the genome signature should be derived is 4 or 5, but shorter values are almost as effective. In general, these results indicate that small, rapidly trained SOMs are generally as good as larger, longer trained ones for the analysis of genome signatures. These results may also be more generally applicable to the use of SOMs for other complex data sets, such as microarray data.
Collapse
Affiliation(s)
- Derek Gatherer
- MRC Virology Unit, Institute of Virology. Church Street, Glasgow G11 5JR, UK
| |
Collapse
|
39
|
Agarwal M, Bhowmick K, Shah K, Krishnamachari A, Dhar SK. Identification and characterization of ARS-like sequences as putative origin(s) of replication in human malaria parasite Plasmodium falciparum. FEBS J 2017. [PMID: 28644560 DOI: 10.1111/febs.14150] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
DNA replication is a fundamental process in genome maintenance, and initiates from several genomic sites (origins) in eukaryotes. In Saccharomyces cerevisiae, conserved sequences known as autonomously replicating sequences (ARSs) provide a landing pad for the origin recognition complex (ORC), leading to replication initiation. Although origins from higher eukaryotes share some common sequence features, the definitive genomic organization of these sites remains elusive. The human malaria parasite Plasmodium falciparum undergoes multiple rounds of DNA replication; therefore, control of initiation events is crucial to ensure proper replication. However, the sites of DNA replication initiation and the mechanism by which replication is initiated are poorly understood. Here, we have identified and characterized putative origins in P. falciparum by bioinformatics analyses and experimental approaches. An autocorrelation measure method was initially used to search for regions with marked fluctuation (dips) in the chromosome, which we hypothesized might contain potential origins. Indeed, S. cerevisiae ARS consensus sequences were found in dip regions. Several of these P. falciparum sequences were validated with chromatin immunoprecipitation-quantitative PCR, nascent strand abundance and a plasmid stability assay. Subsequently, the same sequences were used in yeast to confirm their potential as origins in vivo. Our results identify the presence of functional ARSs in P. falciparum and provide meaningful insights into replication origins in these deadly parasites. These data could be useful in designing transgenic vectors with improved stability for transfection in P. falciparum.
Collapse
Affiliation(s)
- Meetu Agarwal
- Special Centre for Molecular Medicine, Jawaharlal Nehru University, New Delhi, India
| | - Krishanu Bhowmick
- Special Centre for Molecular Medicine, Jawaharlal Nehru University, New Delhi, India
| | - Kushal Shah
- Department of Electrical Engineering, Indian Institute of Technology, New Delhi, India
| | | | - Suman Kumar Dhar
- Special Centre for Molecular Medicine, Jawaharlal Nehru University, New Delhi, India
| |
Collapse
|
40
|
Wang H, Zhi XY, Qiu J, Shi L, Lu Z. Characterization of a Novel Nicotine Degradation Gene Cluster ndp in Sphingomonas melonis TY and Its Evolutionary Analysis. Front Microbiol 2017; 8:337. [PMID: 28337179 PMCID: PMC5343071 DOI: 10.3389/fmicb.2017.00337] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 02/17/2017] [Indexed: 11/13/2022] Open
Abstract
Sphingomonas melonis TY utilizes nicotine as a sole source of carbon, nitrogen, and energy through a variant of the pyridine and pyrrolidine pathways (VPP). A 31-kb novel nicotine-degrading gene cluster, ndp, in strain TY exhibited a different genetic organization with the vpp cluster in strains Ochrobactrum rhizosphaerae SJY1 and Agrobacterium tumefaciens S33. Genes in vpp were separated by a 20-kb interval sequence, while genes in ndp were localized together. Half of the homolog genes were in different locus in ndp and vpp. Moreover, there was a gene encoding putative transporter of nicotine or other critical metabolite in ndp. Among the putative nicotine-degrading related genes, the nicotine hydroxylase, 6-hydroxy-L-nicotine oxidase, 6-hydroxypseudooxynicotine oxidase, and 6-hydroxy-3-succinyl-pyridine monooxygenase responsible for catalyzing the transformation of nicotine to 2, 5-dihydropyridine in the initial four steps of the VPP were characterized. Hydroxylation at C6 of the pyridine ring and dehydrogenation at the C2–C3 bond of the pyrrolidine ring were the key common reactions in the VPP, pyrrolidine and pyridine pathways. Besides, VPP and pyrrolidine pathway shared the same latter part of metabolic pathway. After analysis of metabolic genes in the pyridine, pyrrolidine, and VPP pathways, we found that both the evolutionary features and metabolic mechanisms of the VPP were more similar to the pyrrolidine pathway. The linked ndpHFEG genes shared by the VPP and pyrrolidine pathways indicated that these two pathways might share the same origin, but variants were observed in some bacteria. And we speculated that the pyridine pathway was distributed in Gram-positive bacteria and the VPP and pyrrolidine pathways were distributed in Gram-negative bacteria by using comprehensive homologs searching and phylogenetic tree construction.
Collapse
Affiliation(s)
- Haixia Wang
- Institute of Microbiology, College of Life Sciences, Zhejiang University Hangzhou, China
| | - Xiao-Yang Zhi
- Yunnan Institute of Microbiology, School of Life Sciences, Yunnan University Kunming, China
| | - Jiguo Qiu
- Department of Microbiology, College of Life Sciences, Nanjing Agricultural University Nanjing, China
| | - Longxiang Shi
- Institution of System Engineering, College of Computer Science and Technology, Zhejiang University Hangzhou, China
| | - Zhenmei Lu
- Institute of Microbiology, College of Life Sciences, Zhejiang University Hangzhou, China
| |
Collapse
|
41
|
Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L. The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 2017; 18:151. [PMID: 28187704 PMCID: PMC5303225 DOI: 10.1186/s12864-017-3543-7] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 02/02/2017] [Indexed: 12/02/2022] Open
Abstract
Background The core genome consists of genes shared by the vast majority of a species and is therefore assumed to have been subjected to substantially stronger purifying selection than the more mobile elements of the genome, also known as the accessory genome. Here we examine intragenic base composition differences in core genomes and corresponding accessory genomes in 36 species, represented by the genomes of 731 bacterial strains, to assess the impact of selective forces on base composition in microbes. We also explore, in turn, how these results compare with findings for whole genome intragenic regions. Results We found that GC content in coding regions is significantly higher in core genomes than accessory genomes and whole genomes. Likewise, GC content variation within coding regions was significantly lower in core genomes than in accessory genomes and whole genomes. Relative entropy in coding regions, measured as the difference between observed and expected trinucleotide frequencies estimated from mononucleotide frequencies, was significantly higher in the core genomes than in accessory and whole genomes. Relative entropy was positively associated with coding region GC content within the accessory genomes, but not within the corresponding coding regions of core or whole genomes. Conclusion The higher intragenic GC content and relative entropy, as well as the lower GC content variation, observed in the core genomes is most likely associated with selective constraints. It is unclear whether the positive association between GC content and relative entropy in the more mobile accessory genomes constitutes signatures of selection or selective neutral processes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3543-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway.
| | - Vegard Eldholm
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - John H O Pettersson
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Ola Brynildsrud
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, 1430, Ås, Norway
| |
Collapse
|
42
|
Statistical Methods for Identifying Sequence Motifs Affecting Point Mutations. Genetics 2016; 205:843-856. [PMID: 27974498 DOI: 10.1534/genetics.116.195677] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Accepted: 12/01/2016] [Indexed: 11/18/2022] Open
Abstract
Mutation processes differ between types of point mutation, genomic locations, cells, and biological species. For some point mutations, specific neighboring bases are known to be mechanistically influential. Beyond these cases, numerous questions remain unresolved, including: what are the sequence motifs that affect point mutations? How large are the motifs? Are they strand symmetric? And, do they vary between samples? We present new log-linear models that allow explicit examination of these questions, along with sequence logo style visualization to enable identifying specific motifs. We demonstrate the performance of these methods by analyzing mutation processes in human germline and malignant melanoma. We recapitulate the known CpG effect, and identify novel motifs, including a highly significant motif associated with A[Formula: see text]G mutations. We show that major effects of neighbors on germline mutation lie within [Formula: see text] of the mutating base. Models are also presented for contrasting the entire mutation spectra (the distribution of the different point mutations). We show the spectra vary significantly between autosomes and X-chromosome, with a difference in T[Formula: see text]C transition dominating. Analyses of malignant melanoma confirmed reported characteristic features of this cancer, including statistically significant strand asymmetry, and markedly different neighboring influences. The methods we present are made freely available as a Python library https://bitbucket.org/pycogent3/mutationmotif.
Collapse
|
43
|
Directional and reoccurring sequence change in zoonotic RNA virus genomes visualized by time-series word count. Sci Rep 2016; 6:36197. [PMID: 27808119 PMCID: PMC5093548 DOI: 10.1038/srep36197] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 10/11/2016] [Indexed: 12/18/2022] Open
Abstract
Ebolavirus, MERS coronavirus and influenza virus are zoonotic RNA viruses, which mutate very rapidly. Viral growth depends on many host factors, but human cells may not provide the ideal growth conditions for viruses invading from nonhuman hosts. The present time-series analyses of short and long oligonucleotide compositions in these genomes showed directional changes in their composition after invasion from a nonhuman host, which are thought to recur after future invasions. In the recent West Africa Ebola outbreak, directional time-series changes in a wide range of oligonucleotides were observed in common for three geographic areas, and the directional changes were observed also for the recent MERS coronavirus epidemics starting in the Middle East. In addition, common directional changes in human influenza A viruses were observed for three subtypes, whose epidemics started independently. Long oligonucleotides that showed an evident directional change observed in common for the three subtypes corresponded to some of influenza A siRNAs, whose activities have been experimentally proven. Predicting directional and reoccurring changes in oligonucleotide composition should become important for designing diagnostic RT-PCR primers and therapeutic oligonucleotides with long effectiveness.
Collapse
|
44
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016; 17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open
Abstract
Background Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. Results We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. Conclusions Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| | - Lila Kari
- School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. .,Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.,Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Stephen Solis-Reyes
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| |
Collapse
|
45
|
Satapathy SS, Powdel BR, Buragohain AK, Ray SK. Discrepancy among the synonymous codons with respect to their selection as optimal codon in bacteria. DNA Res 2016; 23:441-449. [PMID: 27426467 PMCID: PMC5066170 DOI: 10.1093/dnares/dsw027] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 05/19/2016] [Indexed: 01/05/2023] Open
Abstract
The different triplets encoding the same amino acid, termed as synonymous codons, are not equally abundant in a genome. Factors such as G + C% and tRNA are known to influence their abundance in a genome. However, the order of the nucleotide in each codon per se might also be another factor impacting on its abundance values. Of the synonymous codons for specific amino acids, some are preferentially used in the high expression genes that are referred to as the 'optimal codons' (OCs). In this study, we compared OCs of the 18 amino acids in 221 species of bacteria. It is observed that there is amino acid specific influence for the selection of OCs. There is also influence of phylogeny in the choice of OCs for some amino acids such as Glu, Gln, Lys and Leu. The phenomenon of codon bias is also supported by the comparative studies of the abundance values of the synonymous codons with same G + C. It is likely that the order of the nucleotides in the triplet codon is also perhaps involved in the phenomenon of codon usage bias in organisms.
Collapse
Affiliation(s)
| | - Bhesh Raj Powdel
- Department of Statistics, Darrang College, Tezpur 784001, Assam, India
| | - Alak Kumar Buragohain
- Department of Molecular Biology and Biotechnology, Tezpur University, Napaam, Tezpur 784028, Assam, India.,Office of the Vice-Chancellor, Dibrugarh University, Dibrugarh 786004, Assam, India
| | - Suvendra Kumar Ray
- Department of Molecular Biology and Biotechnology, Tezpur University, Napaam, Tezpur 784028, Assam, India
| |
Collapse
|
46
|
Michoud G, Jebbar M. High hydrostatic pressure adaptive strategies in an obligate piezophile Pyrococcus yayanosii. Sci Rep 2016; 6:27289. [PMID: 27250364 PMCID: PMC4890121 DOI: 10.1038/srep27289] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 05/14/2016] [Indexed: 02/06/2023] Open
Abstract
Pyrococcus yayanosii CH1, as the first and only obligate piezophilic hyperthermophilic microorganism discovered to date, extends the physical and chemical limits of life on Earth. It was isolated from the Ashadze hydrothermal vent at 4,100 m depth. Multi-omics analyses were performed to study the mechanisms used by the cell to cope with high hydrostatic pressure variations. In silico analyses showed that the P. yayanosii genome is highly adapted to its harsh environment, with a loss of aromatic amino acid biosynthesis pathways and the high constitutive expression of the energy metabolism compared with other non-obligate piezophilic Pyrococcus species. Differential proteomics and transcriptomics analyses identified key hydrostatic pressure-responsive genes involved in translation, chemotaxis, energy metabolism (hydrogenases and formate metabolism) and Clustered Regularly Interspaced Short Palindromic Repeats sequences associated with Cellular apoptosis susceptibility proteins.
Collapse
Affiliation(s)
- Grégoire Michoud
- Univ Brest, CNRS, Ifremer, UMR 6197-Laboratoire de Microbiologie des Environnements Extrêmes (LM2E), Institut Universitaire Européen de la Mer (IUEM), rue Dumont d'Urville, 29 280 Plouzané, France
| | - Mohamed Jebbar
- Univ Brest, CNRS, Ifremer, UMR 6197-Laboratoire de Microbiologie des Environnements Extrêmes (LM2E), Institut Universitaire Européen de la Mer (IUEM), rue Dumont d'Urville, 29 280 Plouzané, France
| |
Collapse
|
47
|
Wada Y, Iwasaki Y, Abe T, Wada K, Tooyama I, Ikemura T. CG-containing oligonucleotides and transcription factor-binding motifs are enriched in human pericentric regions. Genes Genet Syst 2016; 90:43-53. [PMID: 26119665 DOI: 10.1266/ggs.90.43] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Unsupervised data mining capable of extracting a wide range of information from big sequence data without prior knowledge or particular models is highly desirable in an era of big data accumulation for research on genes, genomes and genetic systems. By handling oligonucleotide compositions in genomic sequences as high-dimensional data, we have previously modified the conventional SOM (self-organizing map) for genome informatics and established BLSOM for oligonucleotide composition, which can analyze more than ten million sequences simultaneously and is thus suitable for big data analyses. Oligonucleotides often represent motif sequences responsible for sequence-specific binding of proteins such as transcription factors. The distribution of such functionally important oligonucleotides is probably biased in genomic sequences, and may differ among genomic regions. When constructing BLSOMs to analyze pentanucleotide composition in 50-kb sequences derived from the human genome in this study, we found that BLSOMs did not classify human sequences according to chromosome but revealed several specific zones, which are enriched for a class of CG-containing pentanucleotides; these zones are composed primarily of sequences derived from pericentric regions. The biological significance of enrichment of these pentanucleotides in pericentric regions is discussed in connection with cell type- and stage-dependent formation of the condensed heterochromatin in the chromocenter, which is formed through association of pericentric regions of multiple chromosomes.
Collapse
Affiliation(s)
- Yoshiko Wada
- Department of Bioscience, Nagahama Institute of Bio-Science and Technology
| | | | | | | | | | | |
Collapse
|
48
|
Sheshukova EV, Shindyapina AV, Komarova TV, Dorokhov YL. “Matreshka” genes with alternative reading frames. RUSS J GENET+ 2016. [DOI: 10.1134/s1022795416020149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
49
|
Gonthier P, Sillo F, Lagostina E, Roccotelli A, Cacciola OS, Stenlid J, Garbelotto M. Selection processes in simple sequence repeats suggest a correlation with their genomic location: insights from a fungal model system. BMC Genomics 2015; 16:1107. [PMID: 26714466 PMCID: PMC4696308 DOI: 10.1186/s12864-015-2274-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 12/03/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Adaptive processes shape the evolution of genomes and the diverse functions of different genomic regions are likely to have an impact on the trajectory and outcome of this evolution. The main underlying hypothesis of this study is that the evolution of Simple Sequence Repeats (SSRs) is correlated with the evolution of the genomic region in which they are located, resulting in differences of motif size, number of repeats, and levels of polymorphisms. These differences should be clearly detectable when analyzing the frequency and type of SSRs within the genome of a species, when studying populations within a species, and when comparing closely related sister taxa. By coupling a genome-wide SSR survey in the genome of the plant pathogenic fungus Heterobasidion irregulare with an analysis of intra- and interspecific variability of 39 SSR markers in five populations of the two sibling species H. irregulare and H. annosum, we investigated mechanisms of evolution of SSRs. RESULTS Results showed a clear dominance of trirepeats and a selection against other repeat number, i.e. di- and tetranucleotides, both in regions inside Open Reading Frames (ORFs) and upstream 5' untranslated region (5'UTR). Locus per locus AMOVA showed SSRs both inside ORFs and upstream 5'UTR were more conserved within species compared to SSRs in other genomic regions, suggesting their evolution is constrained by the functions of the regions they are in. Principal coordinates analysis (PCoA) indicated that even if SSRs inside ORFs were less polymorphic than those in intergenic regions, they were more powerful in differentiating species. These findings indicate SSRs evolution undergoes a directional selection pressure comparable to that of the ORFs they interrupt and to that of regions involved in regulatory functions. CONCLUSIONS Our work linked the variation and the type of SSRs with regions upstream 5'UTR, putatively harbouring regulatory elements, and shows that the evolution of SSRs might be affected by their location in the genome. Additionally, this study provides a first glimpse on a possible molecular basis for fast adaptation to the environment mediated by SSRs.
Collapse
Affiliation(s)
- Paolo Gonthier
- Department of Agricultural, Forest and Food Sciences, University of Torino, 10095, Grugliasco, Italy.
| | - Fabiano Sillo
- Department of Agricultural, Forest and Food Sciences, University of Torino, 10095, Grugliasco, Italy.
| | - Elisa Lagostina
- Department of Environmental Sciences, Policy and Management, University of California at Berkeley, CA, 94720, Berkeley, USA. .,Department of Earth and Environmental Sciences, University of Pavia, 27100, Pavia, Italy.
| | - Angela Roccotelli
- Department of Environmental Sciences, Policy and Management, University of California at Berkeley, CA, 94720, Berkeley, USA. .,Department of Agriculture, Mediterranean University of Reggio Calabria, 89122, Reggio Calabria, Italy.
| | - Olga Santa Cacciola
- Department of Agriculture, Food and Environment, University of Catania, 95123, Catania, Italy.
| | - Jan Stenlid
- Department of Forest Mycology and Plant Pathology, Swedish University of Agricultural Sciences, 75007, Uppsala, Sweden.
| | - Matteo Garbelotto
- Department of Environmental Sciences, Policy and Management, University of California at Berkeley, CA, 94720, Berkeley, USA.
| |
Collapse
|
50
|
Herrera S, Reyes-Herrera PH, Shank TM. Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life. Genome Biol Evol 2015; 7:3207-25. [PMID: 26537225 PMCID: PMC4700943 DOI: 10.1093/gbe/evv210] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes--generically known as restriction site associated DNA sequencing (RAD-seq)--is a common strategy to generate genome-wide genotypic and sequence data from eukaryotes. A critical design element of any RAD-seq study is knowledge of the approximate number of genetic markers that can be obtained for a taxon using different restriction enzymes, as this number determines the scope of a project, and ultimately defines its success. This number can only be directly determined if a reference genome sequence is available, or it can be estimated if the genome size and restriction recognition sequence probabilities are known. However, both scenarios are uncommon for nonmodel species. Here, we performed systematic in silico surveys of recognition sequences, for diverse and commonly used type II restriction enzymes across the eukaryotic tree of life. Our observations reveal that recognition sequence frequencies for a given restriction enzyme are strikingly variable among broad eukaryotic taxonomic groups, being largely determined by phylogenetic relatedness. We demonstrate that genome sizes can be predicted from cleavage frequency data obtained with restriction enzymes targeting "neutral" elements. Models based on genomic compositions are also effective tools to accurately calculate probabilities of recognition sequences across taxa, and can be applied to species for which reduced representation data are available (including transcriptomes and neutral RAD-seq data sets). The analytical pipeline developed in this study, PredRAD (https://github.com/phrh/PredRAD), and the resulting databases constitute valuable resources that will help guide the design of any study using RAD-seq or related methods.
Collapse
Affiliation(s)
- Santiago Herrera
- Biology Department, Woods Hole Oceanographic Institution Biology Department, Massachusetts Institute of Technology
| | | | | |
Collapse
|