1
|
Sheinman M, Arndt PF, Massip F. Modeling the mosaic structure of bacterial genomes to infer their evolutionary history. Proc Natl Acad Sci U S A 2024; 121:e2313367121. [PMID: 38517978 PMCID: PMC10990148 DOI: 10.1073/pnas.2313367121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 01/30/2024] [Indexed: 03/24/2024] Open
Abstract
The chronology and phylogeny of bacterial evolution are difficult to reconstruct due to a scarce fossil record. The analysis of bacterial genomes remains challenging because of large sequence divergence, the plasticity of bacterial genomes due to frequent gene loss, horizontal gene transfer, and differences in selective pressure from one locus to another. Therefore, taking advantage of the rich and rapidly accumulating genomic data requires accurate modeling of genome evolution. An important technical consideration is that loci with high effective mutation rates may diverge beyond the detection limit of the alignment algorithms used, biasing the genome-wide divergence estimates toward smaller divergences. In this article, we propose a novel method to gain insight into bacterial evolution based on statistical properties of genome comparisons. We find that the length distribution of sequence matches is shaped by the effective mutation rates of different loci, by the horizontal transfers, and by the aligner sensitivity. Based on these inputs, we build a model and show that it accounts for the empirically observed distributions, taking the Enterobacteriaceae family as an example. Our method allows to distinguish segments of vertical and horizontal origins and to estimate the time divergence and exchange rate between any pair of taxa from genome-wide alignments. Based on the estimated time divergences, we construct a time-calibrated phylogenetic tree to demonstrate the accuracy of the method.
Collapse
Affiliation(s)
- Michael Sheinman
- Institute for Advanced Studies, Sevastopol State University, Sevastopol299053, Crimea
| | - Peter F. Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin12163, Germany
| | - Florian Massip
- Department U900, Centre for Computational Biology, Mines Paris, PSL University, Paris75006, France
- Department U900, Institut Curie, Université Paris Sciences et Lettres, Paris75005, France
- INSERM, U900, Paris75005, France
| |
Collapse
|
2
|
Gao K, Miller J. Primary orthologs from local sequence context. BMC Bioinformatics 2020; 21:48. [PMID: 32028880 PMCID: PMC7006074 DOI: 10.1186/s12859-020-3384-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 01/22/2020] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to gene-, centric modes of inferring paths of sequence evolution are increasingly relevant. Customarily, homologous sequences derived from the same direct ancestor, whose ancestral position in two genomes is usually conserved, are termed "primary" (or "positional") orthologs. Methods based solely on similarity don't reliably distinguish primary orthologs from other homologs; for this, genomic context is often essential. Context-dependent identification of orthologs traditionally relies on genomic context over length scales characteristic of conserved gene order or whole-genome sequence alignment, and can be computationally intensive. RESULTS We demonstrate that short-range sequence context-as short as a single "maximal" match- distinguishes primary orthologs from other homologs across whole genomes. On mammalian whole genomes not preprocessed by repeat-masker, potential orthologs are extracted by genome intersection as "non-nested maximal matches:" maximal matches that are not nested into other maximal matches. It emerges that on both nucleotide and gene scales, non-nested maximal matches recapitulate primary or positional orthologs with high precision and high recall, while the corresponding computation consumes less than one thirtieth of the computation time required by commonly applied whole-genome alignment methods. In regions of genomes that would be masked by repeat-masker, non-nested maximal matches recover orthologs that are inaccessible to Lastz net alignment, for which repeat-masking is a prerequisite. mmRBHs, reciprocal best hits of genes containing non-nested maximal matches, yield novel putative orthologs, e.g. around 1000 pairs of genes for human-chimpanzee. CONCLUSIONS We describe an intersection-based method that requires neither repeat-masking nor alignment to infer evolutionary history of sequences based on short-range genomic sequence context. Ortholog identification based on non-nested maximal matches is parameter-free, and less computationally intensive than many alignment-based methods. It is especially suitable for genome-wide identification of orthologs, and may be applicable to unassembled genomes. We are agnostic as to the reasons for its effectiveness, which may reflect local variation of mean mutation rate.
Collapse
Affiliation(s)
- Kun Gao
- School of Science, Southwest University of Science and Technology, 59 Qinglong Road, Mianyang, Sichuan Province, 621010, People's Republic of China.
| | - Jonathan Miller
- Physics and Biology Unit, Okinawa Institute of Science and Technology Graduate University, 1919-1 Tancha, Onna-son, Kunigami-gun, Okinawa, 904-0495, Japan
| |
Collapse
|
3
|
Polychronopoulos D, King JWD, Nash AJ, Tan G, Lenhard B. Conserved non-coding elements: developmental gene regulation meets genome organization. Nucleic Acids Res 2018; 45:12611-12624. [PMID: 29121339 PMCID: PMC5728398 DOI: 10.1093/nar/gkx1074] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 10/24/2017] [Indexed: 12/20/2022] Open
Abstract
Comparative genomics has revealed a class of non-protein-coding genomic sequences that display an extraordinary degree of conservation between two or more organisms, regularly exceeding that found within protein-coding exons. These elements, collectively referred to as conserved non-coding elements (CNEs), are non-randomly distributed across chromosomes and tend to cluster in the vicinity of genes with regulatory roles in multicellular development and differentiation. CNEs are organized into functional ensembles called genomic regulatory blocks–dense clusters of elements that collectively coordinate the expression of shared target genes, and whose span in many cases coincides with topologically associated domains. CNEs display sequence properties that set them apart from other sequences under constraint, and have recently been proposed as useful markers for the reconstruction of the evolutionary history of organisms. Disruption of several of these elements is known to contribute to diseases linked with development, and cancer. The emergence, evolutionary dynamics and functions of CNEs still remain poorly understood, and new approaches are required to enable comprehensive CNE identification and characterization. Here, we review current knowledge and identify challenges that need to be tackled to resolve the impasse in understanding extreme non-coding conservation.
Collapse
Affiliation(s)
- Dimitris Polychronopoulos
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - James W D King
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Alexander J Nash
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Ge Tan
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK
| | - Boris Lenhard
- Computational Regulatory Genomics Group, MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK.,Sars International Centre for Marine Molecular Biology, University of Bergen, Thormøhlensgate 55, N-5008 Bergen, Norway
| |
Collapse
|
4
|
Comparing the Statistical Fate of Paralogous and Orthologous Sequences. Genetics 2016; 204:475-482. [PMID: 27474728 DOI: 10.1534/genetics.116.193912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Accepted: 07/26/2016] [Indexed: 02/01/2023] Open
Abstract
For several decades, sequence alignment has been a widely used tool in bioinformatics. For instance, finding homologous sequences with a known function in large databases is used to get insight into the function of nonannotated genomic regions. Very efficient tools like BLAST have been developed to identify and rank possible homologous sequences. To estimate the significance of the homology, the ranking of alignment scores takes a background model for random sequences into account. Using this model we can estimate the probability to find two exactly matching subsequences by chance in two unrelated sequences. For two homologous sequences, the corresponding probability is much higher, which allows us to identify them. Here we focus on the distribution of lengths of exact sequence matches between protein-coding regions of pairs of evolutionarily distant genomes. We show that this distribution exhibits a power-law tail with an exponent [Formula: see text] Developing a simple model of sequence evolution by substitutions and segmental duplications, we show analytically and computationally that paralogous and orthologous gene pairs contribute differently to this distribution. Our model explains the differences observed in the comparison of coding and noncoding parts of genomes, thus providing a better understanding of statistical properties of genomic sequences and their evolution.
Collapse
|
5
|
Gao K, Miller J. Human-chimpanzee alignment: ortholog exponentials and paralog power laws. Comput Biol Chem 2014; 53 Pt A:59-70. [PMID: 25443749 DOI: 10.1016/j.compbiolchem.2014.08.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/27/2022]
Abstract
Genomic subsequences conserved between closely related species such as human and chimpanzee exhibit an exponential length distribution, in contrast to the algebraic length distribution observed for sequences shared between distantly related genomes. We find that the former exponential can be further decomposed into an exponential component primarily composed of orthologous sequences, and a truncated algebraic component primarily composed of paralogous sequences.
Collapse
Affiliation(s)
- Kun Gao
- Physics and Biology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan.
| | - Jonathan Miller
- Physics and Biology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan.
| |
Collapse
|
6
|
|
7
|
Massip F, Sheinman M, Schbath S, Arndt PF. How evolution of genomes is reflected in exact DNA sequence match statistics. Mol Biol Evol 2014; 32:524-35. [PMID: 25398628 PMCID: PMC4298173 DOI: 10.1093/molbev/msu313] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Genome evolution is shaped by a multitude of mutational processes, including point mutations, insertions, and deletions of DNA sequences, as well as segmental duplications. These mutational processes can leave distinctive qualitative marks in the statistical features of genomic DNA sequences. One such feature is the match length distribution (MLD) of exactly matching sequence segments within an individual genome or between the genomes of related species. These have been observed to exhibit characteristic power law decays in many species. Here, we show that simple dynamical models consisting solely of duplication and mutation processes can already explain the characteristic features of MLDs observed in genomic sequences. Surprisingly, we find that these features are largely insensitive to details of the underlying mutational processes and do not necessarily rely on the action of natural selection. Our results demonstrate how analyzing statistical features of DNA sequences can help us reveal and quantify the different mutational processes that underlie genome evolution.
Collapse
Affiliation(s)
- Florian Massip
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany UR1077, Unite Mathematiques Informatique et Genome, INRA, domaine de Vilvert, Jouy-en-Josas, France
| | - Michael Sheinman
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Sophie Schbath
- UR1077, Unite Mathematiques Informatique et Genome, INRA, domaine de Vilvert, Jouy-en-Josas, France
| | - Peter F Arndt
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| |
Collapse
|
8
|
Provata A, Nicolis C, Nicolis G. Complexity measures for the evolutionary categorization of organisms. Comput Biol Chem 2014; 53 Pt A:5-14. [PMID: 25216557 DOI: 10.1016/j.compbiolchem.2014.08.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/17/2023]
Abstract
Complexity measures are used to compare the genomic characteristics of five organisms belonging to distinct classes spanning the evolutionary tree: higher eukaryotes, amoebae, unicellular eukaryotes and bacteria. The comparisons are undertaken using the full four-letter alphabet and the coarse grained two-letter alphabets AG-CT and AT-CG. We show that the conditional probability matrix for the four-letter and AT-CG alphabet is markedly asymmetric in eukaryotes while it is nearly symmetric in bacterial genomes. Spatial asymmetry is revealed in the four-letter alphabet, signifying that the probability fluxes are nonvanishing and thus the reading sense of a sequence is irreversible for all organisms. Calculations of the block entropy and excess entropy demonstrate that the human genome accommodates better all possible block configurations, especially for long blocks. With respect to point-to-point details and to spatial arrangement of blocks the exit distance distributions from a particular letter demonstrate long distance characteristics in the eukaryotic sequences for all three alphabets, while the bacterial (prokaryotic) genomes deviate indicating short range characteristics. Overall, the conditional probability, the fluxes, the block entropy content and the exit distance distributions can be used as markers, discriminating between eukaryotic and prokaryotic DNA, allowing in many cases to discern details related to finer classes. In all cases the reduction from four letters to two masks some important statistical and spatial properties, with the AT-CG alphabet having higher ability of discrimination than the AG-CT one. In particular, the AT-CG alphabet reduction accentuates the CpG related properties (conditional probabilities w32, long ranged exit distance distribution for A and T nucleotides), but masks sequence asymmetry and irreversibility in all examined organisms.
Collapse
Affiliation(s)
- A Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15310 Athens, Greece.
| | - C Nicolis
- Institut Royal Météorogique de Belgique, 3 Avenue Circulaire, 1180 Bruxelles, Belgium.
| | - G Nicolis
- Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, C.P. 231, 1050 Bruxelles, Belgium.
| |
Collapse
|
9
|
Li W, Freudenberg J. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Comput Biol Chem 2014; 53 Pt A:108-17. [PMID: 25241312 DOI: 10.1016/j.compbiolchem.2014.08.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 12/31/2022]
Abstract
Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1 kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA.
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, NY 11030, USA
| |
Collapse
|
10
|
Polychronopoulos D, Sellis D, Almirantis Y. Conserved noncoding elements follow power-law-like distributions in several genomes as a result of genome dynamics. PLoS One 2014; 9:e95437. [PMID: 24787386 PMCID: PMC4008492 DOI: 10.1371/journal.pone.0095437] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 03/26/2014] [Indexed: 12/31/2022] Open
Abstract
Conserved, ultraconserved and other classes of constrained elements (collectively referred as CNEs here), identified by comparative genomics in a wide variety of genomes, are non-randomly distributed across chromosomes. These elements are defined using various degrees of conservation between organisms and several thresholds of minimal length. We here investigate the chromosomal distribution of CNEs by studying the statistical properties of distances between consecutive CNEs. We find widespread power-law-like distributions, i.e. linearity in double logarithmic scale, in the inter-CNE distances, a feature which is connected with fractality and self-similarity. Given that CNEs are often found to be spatially associated with genes, especially with those that regulate developmental processes, we verify by appropriate gene masking that a power-law-like pattern emerges irrespectively of whether elements found close or inside genes are excluded or not. An evolutionary model is put forward for the understanding of these findings that includes segmental or whole genome duplication events and eliminations (loss) of most of the duplicated CNEs. Simulations reproduce the main features of the observed size distributions. Power-law-like patterns in the genomic distributions of CNEs are in accordance with current knowledge about their evolutionary history in several genomes.
Collapse
Affiliation(s)
- Dimitris Polychronopoulos
- Institute of Biosciences and Applications, National Center for Scientific Research “Demokritos”, Athens, Greece
- Department of Biochemistry and Molecular Biology, Faculty of Biology, National and Kapodistrian University of Athens, Athens, Greece
| | - Diamantis Sellis
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Yannis Almirantis
- Institute of Biosciences and Applications, National Center for Scientific Research “Demokritos”, Athens, Greece
- * E-mail:
| |
Collapse
|
11
|
Carbone A. Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Sci Rep 2014; 3:2721. [PMID: 24056670 PMCID: PMC3779848 DOI: 10.1038/srep02721] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Accepted: 09/04/2013] [Indexed: 01/14/2023] Open
Abstract
A new approach to estimate the Shannon entropy of a long-range correlated sequence is proposed. The entropy is written as the sum of two terms corresponding respectively to power-law (ordered) and exponentially (disordered) distributed blocks (clusters). The approach is illustrated on the 24 human chromosome sequences by taking the nucleotide composition as the relevant information to be encoded/decoded. Interestingly, the nucleotide composition of the ordered clusters is found, on the average, comparable to the one of the whole analyzed sequence, while that of the disordered clusters fluctuates. From the information theory standpoint, this means that the power-law correlated clusters carry the same information of the whole analysed sequence. Furthermore, the fluctuations of the nucleotide composition of the disordered clusters are linked to relevant biological properties, such as segmental duplications and gene density.
Collapse
Affiliation(s)
- A Carbone
- 1] Politecnico di Torino, Italy [2] ISC-CNR, Unità Università 'La Sapienza' di Roma, Italy [3] ETH Zurich, Switzerland
| |
Collapse
|
12
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
13
|
Taillefer E, Miller J. Exhaustive computation of exact duplications via super and non-nested local maximal repeats. J Bioinform Comput Biol 2013; 12:1350018. [PMID: 24467757 DOI: 10.1142/s0219720013500182] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reasonable time, enabling us to identified these maximal repeats for any sequenced genome. The algorithm used for the identification relies on enhanced suffix array data structure to achieve practical space and time efficiency, to identify and classify the maximal repeats, and to perform further post-processing on the identified duplicated sequences. The simplicity and effectiveness of the implementation makes the method readily extendible to more sophisticated computations. Maxmers can be exhaustively accounted for in few minutes for genome sequences of dozen megabases in length and in less than a day or two for genome sequences of few gigabases in length. One application of duplicated sequence identification is to the study of duplicated sequence length distributions, which our found to exhibit for large lengths a persistent power-law behavior. Variation of estimated exponents of this power law are studied among different species and successive assembly release versions of the same species. This makes the characterization of the power-law regime of sequenced genomes via maximal repeats identification and classification, an important task for the derivation of models that would help us to elucidate sequence duplication and genome evolution.
Collapse
Affiliation(s)
- Eddy Taillefer
- Physics and Biology Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Kunigami-gun 904-0412, Japan
| | | |
Collapse
|
14
|
Koroteev MV, Miller J. Scale-free duplication dynamics: a model for ultraduplication. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:061919. [PMID: 22304128 DOI: 10.1103/physreve.84.061919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 07/04/2011] [Indexed: 05/31/2023]
Abstract
Empirical studies of the genome-wide length distribution of duplicated sequences have revealed an algebraic tail common to nearly all clades. The decay of the tail is often well approximated by a single exponent that takes values within a limited range. We propose and study here scale-free duplication dynamics, a class of model for genome sequence evolution that generates the observed shapes of this distribution. A transition between self-similar and non-self-similar regimes is exhibited. Our model accounts plausibly for the observed form of the algebraic tail, which is not produced by standard models for generating long-range sequence correlations.
Collapse
Affiliation(s)
- M V Koroteev
- Physics and Biology Unit, Okinawa Institute of Science and Technology Suzaki 12-22, Uruma, Okinawa 904-2234, Japan
| | | |
Collapse
|
15
|
Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments. PLoS One 2011; 6:e18464. [PMID: 21779315 PMCID: PMC3136455 DOI: 10.1371/journal.pone.0018464] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Accepted: 03/08/2011] [Indexed: 01/25/2023] Open
Abstract
Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin.
Collapse
|
16
|
Gu P, Reid JG, Gao X, Shaw CA, Creighton C, Tran PL, Zhou X, Drabek RB, Steffen DL, Hoang DM, Weiss MK, Naghavi AO, El-daye J, Khan MF, Legge GB, Wheeler DA, Gibbs RA, Miller JN, Cooney AJ, Gunaratne PH. Novel microRNA candidates and miRNA-mRNA pairs in embryonic stem (ES) cells. PLoS One 2008; 3:e2548. [PMID: 18648548 PMCID: PMC2481296 DOI: 10.1371/journal.pone.0002548] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2007] [Accepted: 05/22/2008] [Indexed: 01/24/2023] Open
Abstract
BACKGROUND MicroRNAS (miRNAS: a class of short non-coding RNAs) are emerging as important agents of post transcriptional gene regulation and integral components of gene networks. MiRNAs have been strongly linked to stem cells, which have a remarkable dual role in development. They can either continuously replenish themselves (self-renewal), or differentiate into cells that execute a limited number of specific actions (pluripotence). METHODOLOGY/PRINCIPAL FINDINGS In order to identify novel miRNAs from narrow windows of development we carried out an in silico search for micro-conserved elements (MCE) in adult tissue progenitor transcript sequences. A plethora of previously unknown miRNA candidates were revealed including 545 small RNAs that are enriched in embryonic stem (ES) cells over adult cells. Approximately 20% of these novel candidates are down-regulated in ES (Dicer(-/-)) ES cells that are impaired in miRNA maturation. The ES-enriched miRNA candidates exhibit distinct and opposite expression trends from mmu-mirs (an abundant class in adult tissues) during retinoic acid (RA)-induced ES cell differentiation. Significant perturbation of trends is found in both miRNAs and novel candidates in ES (GCNF(-/-)) cells, which display loss of repression of pluripotence genes upon differentiation. CONCLUSION/SIGNIFICANCE Combining expression profile information with miRNA target prediction, we identified miRNA-mRNA pairs that correlate with ES cell pluripotence and differentiation. Perturbation of these pairs in the ES (GCNF(-/-)) mutant suggests a role for miRNAs in the core regulatory networks underlying ES cell self-renewal, pluripotence and differentiation.
Collapse
Affiliation(s)
- Peili Gu
- Department of Molecular & Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Cancer Genetics, M.D. Anderson Cancer Center, University of Texas , Houston, Texas, United States of America
| | - Jeffrey G. Reid
- Department of Chemistry, University of Houston, Houston, Texas, United States of America
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
- W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America
| | - Xiaolian Gao
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
- Department of Chemistry, University of Houston, Houston, Texas, United States of America
| | - Chad A. Shaw
- Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Chad Creighton
- Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Peter L. Tran
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | | | - Rafal B. Drabek
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - David L. Steffen
- Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Bioinformatics Research Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - David M. Hoang
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - Michelle K. Weiss
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - Arash O. Naghavi
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - Jad El-daye
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - Mahjabeen F. Khan
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - Glen B. Legge
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| | - David A. Wheeler
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Richard A. Gibbs
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Jonathan N. Miller
- Department of Biochemistry & Molecular Biology, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Austin J. Cooney
- Department of Molecular & Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
| | - Preethi H. Gunaratne
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
- Department of Pathology, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America
| |
Collapse
|
17
|
Abstract
Adaptations that are the result of exercise require a multitude of changes at the level of gene expression. The mechanisms involved in regulating these changes are many, and can occur at various points in the pathways that affect gene expression. The completion of the human genome sequence, along with the genomes of related species, has provided an enormous amount of information to help dissect and understand these pathways. High-throughput methods, such as DNA microarrays, were the first on the scene to take advantage of this wealth of information. A new generation of microarrays has now taken the next step in revealing the mechanisms controlling gene expression. Analysis of the regulation of gene expression can now be profiled in a high-throughput fashion. However, the application of this technology has yet to be fully realized in the exercise physiology community. This review will highlight some of the latest advances in microarrays and briefly discuss some potential applications to the field of exercise physiology.
Collapse
Affiliation(s)
- Carl Virtanen
- Microarray Centre, University Health Network, MaRS Centre, Toronto Discovery Tower, 101 College St., Toronto, ON M5G 1L7, Canada
| | | |
Collapse
|
18
|
Winkler DA. Network models in drug discovery and regenerative medicine. BIOTECHNOLOGY ANNUAL REVIEW 2008; 14:143-70. [PMID: 18606362 DOI: 10.1016/s1387-2656(08)00005-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Network motifs and modelling paradigms are attracting increasing attention as modelling tools in drug design and development, and in regenerative medicine. There is a gradual but inexorable convergence between these hitherto disparate disciplines. This review summarizes some very recent work in these areas, leading to an understanding of the complementary roles networks play and factors driving this convergence: network paradigms can be excellent ways of modelling and understanding drug molecules and their action, an understanding of the robustness and vulnerabilities of biological targets may improve the efficacy of drug design and discovery, drug design has an increasingly large role to play in directing stem cell properties, stem cell regulatory networks can be modelled in useful ways using network models at a reasonable level of scale, and the network tools of drug design are also very useful for the design of biomaterials used in regenerative medicine.
Collapse
Affiliation(s)
- David A Winkler
- CSIRO Molecular and Health Technologies, Clayton 3168, Australia.
| |
Collapse
|