1
|
Duality Between the Local Score of One Sequence and Constrained Hidden Markov Model. Methodol Comput Appl Probab 2022. [DOI: 10.1007/s11009-021-09856-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
2
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
3
|
Lassalle F, Campillo T, Vial L, Baude J, Costechareyre D, Chapulliot D, Shams M, Abrouk D, Lavire C, Oger-Desfeux C, Hommais F, Guéguen L, Daubin V, Muller D, Nesme X. Genomic species are ecological species as revealed by comparative genomics in Agrobacterium tumefaciens. Genome Biol Evol 2011; 3:762-81. [PMID: 21795751 PMCID: PMC3163468 DOI: 10.1093/gbe/evr070] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The definition of bacterial species is based on genomic similarities, giving rise to the operational concept of genomic species, but the reasons of the occurrence of differentiated genomic species remain largely unknown. We used the Agrobacterium tumefaciens species complex and particularly the genomic species presently called genomovar G8, which includes the sequenced strain C58, to test the hypothesis of genomic species having specific ecological adaptations possibly involved in the speciation process. We analyzed the gene repertoire specific to G8 to identify potential adaptive genes. By hybridizing 25 strains of A. tumefaciens on DNA microarrays spanning the C58 genome, we highlighted the presence and absence of genes homologous to C58 in the taxon. We found 196 genes specific to genomovar G8 that were mostly clustered into seven genomic islands on the C58 genome—one on the circular chromosome and six on the linear chromosome—suggesting higher plasticity and a major adaptive role of the latter. Clusters encoded putative functional units, four of which had been verified experimentally. The combination of G8-specific functions defines a hypothetical species primary niche for G8 related to commensal interaction with a host plant. This supports that the G8 ancestor was able to exploit a new ecological niche, maybe initiating ecological isolation and thus speciation. Searching genomic data for synapomorphic traits is a powerful way to describe bacterial species. This procedure allowed us to find such phenotypic traits specific to genomovar G8 and thus propose a Latin binomial, Agrobacterium fabrum, for this bona fide genomic species.
Collapse
Affiliation(s)
- Florent Lassalle
- Université de Lyon, Université Lyon 1, CNRS, INRA, Laboratoire Ecologie Microbienne Lyon, UMR 5557, USC 1193, Villeurbanne, France
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Elhaik E, Graur D, Josić K, Landan G. Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm. Nucleic Acids Res 2010; 38:e158. [PMID: 20571085 PMCID: PMC2926622 DOI: 10.1093/nar/gkq532] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen–Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, DJS, using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas DJS failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.
Collapse
Affiliation(s)
- Eran Elhaik
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | | | | | | |
Collapse
|
5
|
Elhaik E, Graur D, Josic K. Comparative testing of DNA segmentation algorithms using benchmark simulations. Mol Biol Evol 2009; 27:1015-24. [PMID: 20018981 DOI: 10.1093/molbev/msp307] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Numerous segmentation methods for the detection of compositionally homogeneous domains within genomic sequences have been proposed. Unfortunately, these methods yield inconsistent results. Here, we present a benchmark consisting of two sets of simulated genomic sequences for testing the performances of segmentation algorithms. Sequences in the first set are composed of fixed-sized homogeneous domains, distinct in their between-domain guanine and cytosine (GC) content variability. The sequences in the second set are composed of a mosaic of many short domains and a few long ones, distinguished by sharp GC content boundaries between neighboring domains. We use these sets to test the performance of seven segmentation algorithms in the literature. Our results show that recursive segmentation algorithms based on the Jensen-Shannon divergence outperform all other algorithms. However, even these algorithms perform poorly in certain instances because of the arbitrary choice of a segmentation-stopping criterion.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Biology & Biochemistry, University of Houston, TX, USA.
| | | | | |
Collapse
|
6
|
Boussau B, Guéguen L, Gouy M. A mixture model and a hidden markov model to simultaneously detect recombination breakpoints and reconstruct phylogenies. Evol Bioinform Online 2009; 5:67-79. [PMID: 19812727 PMCID: PMC2747125 DOI: 10.4137/ebo.s2242] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Homologous recombination is a pervasive biological process that affects sequences in all living organisms and viruses. In the presence of recombination, the evolutionary history of an alignment of homologous sequences cannot be properly depicted by a single bifurcating tree: some sites have evolved along a specific phylogenetic tree, others have followed another path. Methods available to analyse recombination in sequences usually involve an analysis of the alignment through sliding-windows, or are particularly demanding in computational resources, and are often limited to nucleotide sequences. In this article, we propose and implement a Mixture Model on trees and a phylogenetic Hidden Markov Model to reveal recombination breakpoints while searching for the various evolutionary histories that are present in an alignment known to have undergone homologous recombination. These models are sufficiently efficient to be applied to dozens of sequences on a single desktop computer, and can handle equivalently nucleotide or protein sequences. We estimate their accuracy on simulated sequences and test them on real data.
Collapse
Affiliation(s)
- Bastien Boussau
- Université de Lyon, université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, 43 boulevard du 11 novembre 1918, Villeurbanne F-69622, France.
| | | | | |
Collapse
|
7
|
Abstract
BACKGROUND One of the most striking features of mammalian and birds chromosomes is the variation in the guanine-cytosine (GC) content that occurs over scales of hundreds of kilobases to megabases; this is known as the "isochore" structure. Among other vertebrates the presence of isochores depends upon the taxon; isochore are clearly present in Crocodiles and turtles but fish genome seems very homogeneous on GC content. This has suggested a unique isochore origin after the divergence between Sarcopterygii and Actinopterygii, but before that between Sauropsida and mammals. However during more than 30 years of analysis, isochore characteristics have been studied and many important biological properties have been associated with the isochore structure of human genomes. For instance, the genes are more compact and their density is highest in GC rich isochores. RESULTS This paper shows in teleost fish genomes the existence of "GC segmentation" sharing some of the characteristics of isochores although teleost fish genomes presenting a particular homogeneity in CG content. The entire genomes of T nigroviridis and D rerio are now available, and this has made it possible to check whether a mosaic structure associated with isochore properties can be found in these fishes. In this study, hidden Markov models were trained on fish genes (T nigroviridis and D rerio) which were classified by using the isochore class of their human orthologous. A clear segmentation of these genomes was detected. CONCLUSION The GC content is an excellent indicator of isochores in heterogeneous genomes as mammals. The segmentation we obtained were well correlated with GC content and other properties associated to GC content such as gene density, the number of exons per gene and the length of introns. Therefore, the GC content is the main property that allows the detection of isochore but more biological properties have to be taken into account. This method allows detecting isochores in homogeneous genomes.
Collapse
|
8
|
Boussau B, Guéguen L, Gouy M. Accounting for horizontal gene transfers explains conflicting hypotheses regarding the position of aquificales in the phylogeny of Bacteria. BMC Evol Biol 2008; 8:272. [PMID: 18834516 PMCID: PMC2584045 DOI: 10.1186/1471-2148-8-272] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2008] [Accepted: 10/03/2008] [Indexed: 01/09/2023] Open
Abstract
Background Despite a large agreement between ribosomal RNA and concatenated protein phylogenies, the phylogenetic tree of the bacterial domain remains uncertain in its deepest nodes. For instance, the position of the hyperthermophilic Aquificales is debated, as their commonly observed position close to Thermotogales may proceed from horizontal gene transfers, long branch attraction or compositional biases, and may not represent vertical descent. Indeed, another view, based on the analysis of rare genomic changes, places Aquificales close to epsilon-Proteobacteria. Results To get a whole genome view of Aquifex relationships, all trees containing sequences from Aquifex in the HOGENOM database were surveyed. This study revealed that Aquifex is most often found as a neighbour to Thermotogales. Moreover, informational genes, which appeared to be less often transferred to the Aquifex lineage than non-informational genes, most often placed Aquificales close to Thermotogales. To ensure these results did not come from long branch attraction or compositional artefacts, a subset of carefully chosen proteins from a wide range of bacterial species was selected for further scrutiny. Among these genes, two phylogenetic hypotheses were found to be significantly more likely than the others: the most likely hypothesis placed Aquificales as a neighbour to Thermotogales, and the second one with epsilon-Proteobacteria. We characterized the genes that supported each of these two hypotheses, and found that differences in rates of evolution or in amino-acid compositions could not explain the presence of two incongruent phylogenetic signals in the alignment. Instead, evidence for a large Horizontal Gene Transfer between Aquificales and epsilon-Proteobacteria was found. Conclusion Methods based on concatenated informational proteins and methods based on character cladistics led to different conclusions regarding the position of Aquificales because this lineage has undergone many horizontal gene transfers. However, if a tree of vertical descent can be reconstructed for Bacteria, our results suggest Aquificales should be placed close to Thermotogales.
Collapse
Affiliation(s)
- Bastien Boussau
- Université de Lyon; Université Lyon 1; CNRS; INRIA; Laboratoire de Biométrie et Biologie Evolutive, 43 boulevard du 11 novembre 1918, Villeurbanne F-69622, France.
| | | | | |
Collapse
|
9
|
Aston JAD, Martin DEK. Distributions associated with general runs and patterns in hidden Markov models. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas125] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Haiminen N, Mannila H. Discovering isochores by least-squares optimal segmentation. Gene 2007; 394:53-60. [PMID: 17389148 DOI: 10.1016/j.gene.2007.01.028] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 01/16/2007] [Accepted: 01/22/2007] [Indexed: 10/23/2022]
Abstract
The isochore structure of a genome is observable by variation in the G+C (guanine and cytosine) content within and between the chromosomes. Describing the isochore structure of vertebrate genomes is a challenging task, and many computational methods have been developed and applied to it. Here we apply a well-known least-squares optimal segmentation algorithm to isochore discovery. The algorithm finds the best division of the sequence into k pieces, such that the segments are internally as homogeneous as possible. We show how this simple segmentation method can be applied to isochore discovery using as input the G+C content of sliding windows on the sequence. To evaluate the performance of this segmentation technique on isochore detection, we present results from segmenting previously studied isochore regions of the human genome. Detailed results on the MHC locus, on parts of chromosomes 21 and 22, and on a 100 Mb region from chromosome 1 are similar to previously suggested isochore structures. We also give results on segmenting all 22 autosomal human chromosomes. An advantage of this technique is that oversegmentation of G+C rich regions can generally be avoided. This is because the technique concentrates on greater global, instead of smaller local, differences in the sequence composition. The effect is further emphasized by a log-transformation of the data that lowers the high variance that is observed in G+C rich regions. We conclude that the least-squares optimal segmentation method is computationally efficient and yields results close to previous biologically motivated isochore structures.
Collapse
MESH Headings
- Algorithms
- Chromosomes, Human/genetics
- Chromosomes, Human, Pair 1/genetics
- Chromosomes, Human, Pair 21/genetics
- Chromosomes, Human, Pair 22/genetics
- Chromosomes, Human, Pair 6/genetics
- GC Rich Sequence
- Genome, Human
- Genomics/statistics & numerical data
- Humans
- Isochores/chemistry
- Isochores/genetics
- Least-Squares Analysis
- Major Histocompatibility Complex
Collapse
Affiliation(s)
- Niina Haiminen
- HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland.
| | | |
Collapse
|
11
|
Haiminen N, Mannila H, Terzi E. Comparing segmentations by applying randomization techniques. BMC Bioinformatics 2007; 8:171. [PMID: 17521423 PMCID: PMC1904250 DOI: 10.1186/1471-2105-8-171] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2007] [Accepted: 05/23/2007] [Indexed: 11/25/2022] Open
Abstract
Background There exist many segmentation techniques for genomic sequences, and the segmentations can also be based on many different biological features. We show how to evaluate and compare the quality of segmentations obtained by different techniques and alternative biological features. Results We apply randomization techniques for evaluating the quality of a given segmentation. Our example applications include isochore detection and the discovery of coding-noncoding structure. We obtain segmentations of relevant sequences by applying different techniques, and use alternative features to segment on. We show that some of the obtained segmentations are very similar to the underlying true segmentations, and this similarity is statistically significant. For some other segmentations, we show that equally good results are likely to appear by chance. Conclusion We introduce a framework for evaluating segmentation quality, and demonstrate its use on two examples of segmental genomic structures. We transform the process of quality evaluation from simply viewing the segmentations, to obtaining p-values denoting significance of segmentation similarity.
Collapse
Affiliation(s)
- Niina Haiminen
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
| | - Heikki Mannila
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
- Laboratory of Computer and Information Science, Helsinki University of Technology, FI-02015 TKK, Finland
| | - Evimaria Terzi
- HIIT Basic Research Unit, Department of Computer Science, P.O.Box 68, FI-00014 University of Helsinki, Finland
| |
Collapse
|
12
|
Melodelima C, Gautier C, Piau D. A markovian approach for the prediction of mouse isochores. J Math Biol 2007; 55:353-64. [PMID: 17486342 DOI: 10.1007/s00285-007-0087-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Revised: 03/01/2007] [Indexed: 10/23/2022]
Abstract
Hidden Markov models (HMMs) are effective tools to detect series of statistically homogeneous structures, but they are not well suited to analyse complex structures. For example, the duration of stay in a state of a HMM must follow a geometric law. Numerous other methodological difficulties are encountered when using HMMs to segregate genes from transposons or retroviruses, or to determine the isochore classes of genes. The aim of this paper is to analyse these methodological difficulties, and to suggest new tools for the exploration of genome data. We show that HMMs can be used to analyse complex gene structures with bell-shaped length distribution by using convolution of geometric distributions. Thus, we have introduced macros-states to model the distributions of the lengths of the regions. Our study shows that simple HMM could be used to model the isochore organisation of the mouse genome. This potential use of markovian models to help in data exploration has been underestimated until now.
Collapse
Affiliation(s)
- Christelle Melodelima
- UMR 5558 CNRS Biométrie et Biologie Evolutive, Université Claude Bernard Lyon 1, 43 boulevard du 11 Novembre 1818, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|