1
|
de la Fuente R, Díaz-Villanueva W, Arnau V, Moya A. Genomic Signature in Evolutionary Biology: A Review. BIOLOGY 2023; 12:biology12020322. [PMID: 36829597 PMCID: PMC9953303 DOI: 10.3390/biology12020322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/11/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
Collapse
Affiliation(s)
- Rebeca de la Fuente
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Correspondence:
| | - Wladimiro Díaz-Villanueva
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of the Valencian Community (FISABIO), 46020 Valencia, Spain
- CIBER in Epidemiology and Public Health (CIBEResp), 28029 Madrid, Spain
| |
Collapse
|
2
|
Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L. The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 2017; 18:151. [PMID: 28187704 PMCID: PMC5303225 DOI: 10.1186/s12864-017-3543-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 02/02/2017] [Indexed: 12/02/2022] Open
Abstract
Background The core genome consists of genes shared by the vast majority of a species and is therefore assumed to have been subjected to substantially stronger purifying selection than the more mobile elements of the genome, also known as the accessory genome. Here we examine intragenic base composition differences in core genomes and corresponding accessory genomes in 36 species, represented by the genomes of 731 bacterial strains, to assess the impact of selective forces on base composition in microbes. We also explore, in turn, how these results compare with findings for whole genome intragenic regions. Results We found that GC content in coding regions is significantly higher in core genomes than accessory genomes and whole genomes. Likewise, GC content variation within coding regions was significantly lower in core genomes than in accessory genomes and whole genomes. Relative entropy in coding regions, measured as the difference between observed and expected trinucleotide frequencies estimated from mononucleotide frequencies, was significantly higher in the core genomes than in accessory and whole genomes. Relative entropy was positively associated with coding region GC content within the accessory genomes, but not within the corresponding coding regions of core or whole genomes. Conclusion The higher intragenic GC content and relative entropy, as well as the lower GC content variation, observed in the core genomes is most likely associated with selective constraints. It is unclear whether the positive association between GC content and relative entropy in the more mobile accessory genomes constitutes signatures of selection or selective neutral processes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3543-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway.
| | - Vegard Eldholm
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - John H O Pettersson
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Ola Brynildsrud
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, 1430, Ås, Norway
| |
Collapse
|
3
|
Homology-independent metrics for comparative genomics. Comput Struct Biotechnol J 2015; 13:352-7. [PMID: 26029354 PMCID: PMC4446528 DOI: 10.1016/j.csbj.2015.04.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Revised: 04/06/2015] [Accepted: 04/18/2015] [Indexed: 11/24/2022] Open
Abstract
A mainstream procedure to analyze the wealth of genomic data available nowadays is the detection of homologous regions shared across genomes, followed by the extraction of biological information from the patterns of conservation and variation observed in such regions. Although of pivotal importance, comparative genomic procedures that rely on homology inference are obviously not applicable if no homologous regions are detectable. This fact excludes a considerable portion of “genomic dark matter” with no significant similarity — and, consequently, no inferred homology to any other known sequence — from several downstream comparative genomic methods. In this review we compile several sequence metrics that do not rely on homology inference and can be used to compare nucleotide sequences and extract biologically meaningful information from them. These metrics comprise several compositional parameters calculated from sequence data alone, such as GC content, dinucleotide odds ratio, and several codon bias metrics. They also share other interesting properties, such as pervasiveness (patterns persist on smaller scales) and phylogenetic signal. We also cite examples where these homology-independent metrics have been successfully applied to support several bioinformatics challenges, such as taxonomic classification of biological sequences without homology inference. They where also used to detect higher-order patterns of interactions in biological systems, ranging from detecting coevolutionary trends between the genomes of viruses and their hosts to characterization of gene pools of entire microbial communities. We argue that, if correctly understood and applied, homology-independent metrics can add important layers of biological information in comparative genomic studies without prior homology inference.
Collapse
|
4
|
Necessary relations for nucleotide frequencies. J Theor Biol 2015; 374:179-82. [PMID: 25843217 DOI: 10.1016/j.jtbi.2015.03.025] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Revised: 02/01/2015] [Accepted: 03/21/2015] [Indexed: 11/21/2022]
Abstract
Genome composition analysis of di-, tri- and tetra-nucleotide frequencies is known to be evolutionarily informative, and useful in metagenomic studies, where binning of raw sequence data is often an important first step. Patterns appearing in genome composition analysis may be due to evolutionary processes or purely mathematical relations. For example, the total number of dinucleotides in a sequence is equal to the sum of the individual totals of the sixteen types of dinucleotide, and this is entirely independent of any assumptions made regarding mutation or selection, or indeed any physical or chemical process. Before any statistical analysis can be attempted, a knowledge of all necessary mathematical relations is required. I show that 25% of di-, tri- and tetra-nucleotide frequencies can be written as simple sums and differences of the remainder. The vast majority of organisms have circular genomes, for which these relations are exact and necessary. In the case of linear molecules, the absolute error is very nearly zero, and does not grow with contiguous sequence length. As a result of the new, necessary relations presented here, the foundations of the statistical analysis of di-, tri- and tetra-nucleotide frequencies, and k-mer analysis in general, need to be revisited.
Collapse
|
5
|
[Current status of theoretical studies on essential genes in microbes]. YI CHUAN = HEREDITAS 2012; 34:420-30. [PMID: 22522159 DOI: 10.3724/sp.j.1005.2012.00420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Essential genes are indispensable for the survival of an organism in optimal conditions. Recently, study on essential gene is becoming a hot topic of microbiology, genomics, and bioinformatics. This paper described the experiments that determined essential genes in some microbes and the theoretical researches on essential genes were reviewed. The major content contained comparison of essential genes and non-essential genes based on information on evolutionary conservation and sequence composition, and in silico prediction of essential genes, and analysis of the chromosomal distributions of essential genes. Finally, related progresses were concluded and the open problems were pointed out.
Collapse
|
6
|
Bosi E, Fani R, Fondi M. The mosaicism of plasmids revealed by atypical genes detection and analysis. BMC Genomics 2011; 12:403. [PMID: 21824433 PMCID: PMC3166947 DOI: 10.1186/1471-2164-12-403] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2011] [Accepted: 08/08/2011] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND From an evolutionary viewpoint, prokaryotic genomes are extremely plastic and dynamic, since large amounts of genetic material are continuously added and/or lost through promiscuous gene exchange. In this picture, plasmids play a key role, since they can be transferred between different cells and, through genetic rearrangement(s), undergo gene(s) load, leading, in turn, to the appearance of important metabolic innovations that might be relevant for cell life. Despite their central position in bacterial evolution, a massive analysis of newly acquired functional blocks [likely the result of horizontal gene transfer (HGT) events] residing on plasmids is still missing. RESULTS We have developed a computational, composition-based, pipeline to scan almost 2000 plasmids for genes that differ significantly from their hosting molecule. Plasmids atypical genes (PAGs) were about 6% of the total plasmids ORFs and, on average, each plasmid possessed 4.4 atypical genes. Nevertheless, conjugative plasmids were shown to possess an amount of atypical genes than that found in not mobilizable plasmids, providing strong support for the central role suggested for conjugative plasmids in the context of HGT. Part of the retrieved PAGs are organized into (mainly short) clusters and are involved in important biological processes (detoxification, antibiotic resistance, virulence), revealing the importance of HGT in the spreading of metabolic pathways within the whole microbial community. Lastly, our analysis revealed that PAGs mainly derive from other plasmid (rather than coming from phages and/or chromosomes), suggesting that plasmid-plasmid DNA exchange might be the primary source of metabolic innovations in this class of mobile genetic elements. CONCLUSIONS In this work we have performed the first large scale analysis of atypical genes that reside on plasmid molecules to date. Our findings on PAGs function, organization, distribution and spreading reveal the importance of plasmids-mediated HGT within the complex bacterial evolutionary network and in the dissemination of important biological traits.
Collapse
Affiliation(s)
- Emanuele Bosi
- Lab, of Microbial and Molecular Evolution, Dept, of Evolutionary Biology, University of Florence, Italy
| | | | | |
Collapse
|
7
|
Yu JF, Sun X. Reannotation of protein-coding genes based on an improved graphical representation of DNA sequence. J Comput Chem 2010; 31:2126-35. [PMID: 20175214 DOI: 10.1002/jcc.21500] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Over annotation of protein coding genes is common phenomenon in microbial genomes, the genome of Amsacta moorei entomopoxvirus (AmEPV) is a typical case, because more than 63% of its annotated ORFs are hypothetical. In this article, we propose an improved graphical representation titled I-TN (improved curve based on trinucleotides) curve, which allows direct inspection of composition and distribution of codons and asymmetric gene structure. This improved graphical representation can also provide convenient tools for genome analysis. From this presentation, 18 variables are exploited as numerical descriptors to represent the specific features of protein coding genes quantitatively, with which we reannotate the protein coding genes in several viral genomes. Using the parameters trained on the experimentally validated genes, all of the 30 experimentally validated genes and 63 putative genes in AmEPV genome are recognized correctly as protein coding, the accuracies of the present method for self-test and cross-validation are 100%, respectively. Twenty-eight annotated hypothetical genes are predicted as noncoding, and then the number of reannotated protein coding genes in AmEPV should be 266 instead of 294 reported in the original annotations. Extending the present method trained in AmEPV to other entomopoxvirus genomes directly, such as Melanoplus sanguinipes entomopoxvirus (MsEPV), all of the 123 annotated function-known and putative genes are recognized correctly as protein coding, and 17 hypothetical genes are recognized as noncoding. The present method could also be extended to other genomes with or without adaptation of training sets with high accuracy.
Collapse
Affiliation(s)
- Jia-Feng Yu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China
| | | |
Collapse
|
8
|
Bohlin J, Snipen L, Hardy SP, Kristoffersen AB, Lagesen K, Dønsvik T, Skjerve E, Ussery DW. Analysis of intra-genomic GC content homogeneity within prokaryotes. BMC Genomics 2010; 11:464. [PMID: 20691090 PMCID: PMC3091660 DOI: 10.1186/1471-2164-11-464] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2010] [Accepted: 08/06/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Bacterial genomes possess varying GC content (total guanines (Gs) and cytosines (Cs) per total of the four bases within the genome) but within a given genome, GC content can vary locally along the chromosome, with some regions significantly more or less GC rich than on average. We have examined how the GC content varies within microbial genomes to assess whether this property can be associated with certain biological functions related to the organism's environment and phylogeny. We utilize a new quantity GCVAR, the intra-genomic GC content variability with respect to the average GC content of the total genome. A low GCVAR indicates intra-genomic GC homogeneity and high GCVAR heterogeneity. RESULTS The regression analyses indicated that GCVAR was significantly associated with domain (i.e. archaea or bacteria), phylum, and oxygen requirement. GCVAR was significantly higher among anaerobes than both aerobic and facultative microbes. Although an association has previously been found between mean genomic GC content and oxygen requirement, our analysis suggests that no such association exits when phylogenetic bias is accounted for. A significant association between GCVAR and mean GC content was also found but appears to be non-linear and varies greatly among phyla. CONCLUSIONS Our findings show that GCVAR is linked with oxygen requirement, while mean genomic GC content is not. We therefore suggest that GCVAR should be used as a complement to mean GC content.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, Department of Food Safety and Infection Biology, Ullevålsveien 72, P,O, Box 8146 Dep, NO-0033 Oslo, Norway.
| | | | | | | | | | | | | | | |
Collapse
|
9
|
Examination of genome homogeneity in prokaryotes using genomic signatures. PLoS One 2009; 4:e8113. [PMID: 19956556 PMCID: PMC2781299 DOI: 10.1371/journal.pone.0008113] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 11/05/2009] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND DNA word frequencies, normalized for genomic AT content, are remarkably stable within prokaryotic genomes and are therefore said to reflect a "genomic signature." The genomic signatures can be used to phylogenetically classify organisms from arbitrary sampled DNA. Genomic signatures can also be used to search for horizontally transferred DNA or DNA regions subjected to special selection forces. Thus, the stability of the genomic signature can be used as a measure of genomic homogeneity. The factors associated with the stability of the genomic signatures are not known, and this motivated us to investigate further. We analyzed the intra-genomic variance of genomic signatures based on AT content normalization (0(th) order Markov model) as well as genomic signatures normalized by smaller DNA words (1(st) and 2(nd) order Markov models) for 636 sequenced prokaryotic genomes. Regression models were fitted, with intra-genomic signature variance as the response variable, to a set of factors representing genomic properties such as genomic AT content, genome size, habitat, phylum, oxygen requirement, optimal growth temperature and oligonucleotide usage variance (OUV, a measure of oligonucleotide usage bias), measured as the variance between genomic tetranucleotide frequencies and Markov chain approximated tetranucleotide frequencies, as predictors. PRINCIPAL FINDINGS Regression analysis revealed that OUV was the most important factor (p<0.001) determining intra-genomic homogeneity as measured using genomic signatures. This means that the less random the oligonucleotide usage is in the sense of higher OUV, the more homogeneous the genome is in terms of the genomic signature. The other factors influencing variance in the genomic signature (p<0.001) were genomic AT content, phylum and oxygen requirement. CONCLUSIONS Genomic homogeneity in prokaryotes is intimately linked to genomic GC content, oligonucleotide usage bias (OUV) and aerobiosis, while oligonucleotide usage bias (OUV) is associated with genomic GC content, aerobiosis and habitat.
Collapse
|