1
|
Burks DJ, Azad RK. Mapping Strengths and Weaknesses of Different Clustering Approaches to Deciphering Bacterial Chimerism. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:422-439. [PMID: 35925817 DOI: 10.1089/omi.2022.0062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Bacterial genomes are chimeras of DNA of different ancestries. Deconstructing chimeric genomes is central to understanding the evolutionary trajectories of their disparate components and thus the organisms as a whole in the light of their evolutionary contexts. Of specific interest is to delineate and quantify native (vertically inherited) and alien (horizontally acquired) components of bacterial genomes and also specify genomic fractions that represent different donor sources. An agglomerative clustering procedure that prioritizes grouping of proximal similar genomic segments has previously been invoked for this purpose in conjunction with a recursive segmentation procedure. Surprisingly, however, the relative strengths and weaknesses of different clustering approaches to deciphering bacterial chimerism have not yet been investigated, despite the need to robustly interpret tens of thousands of completely sequenced bacterial genomes and nearly complete genome assemblies available in the public databases. To bridge this knowledge gap and develop more robust approaches, we assessed different clustering methods, including segment order based (proximal) clustering, hierarchical clustering, affinity propagation clustering, and a novel network clustering approach on chimeric genomes modeled after bacterial genomes representing a broad spectrum of compositional complexity. Although segment order-based clustering and network clustering compared favorably with the other approaches in discriminating between native and alien DNA at genome optimized settings, network clustering did consistently better than other methods at parametric settings optimized on all test genomes together. Segment order-based clustering and hierarchical clustering outperformed other methods in alien DNA identification while preserving donor identity in the genomes. Our study highlights the strengths and weaknesses of different approaches and suggests how this can be leveraged to achieve a more robust deconstruction of bacterial chimerism.
Collapse
Affiliation(s)
- David J Burks
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
| | - Rajeev K Azad
- Department of Biological Sciences, BioDiscovery Institute, University of North Texas, Denton, Texas, USA
- Department of Mathematics, University of North Texas, Denton, Texas, USA
| |
Collapse
|
2
|
Maderazo D, Flegg JA, Algama M, Ramialison M, Keith J. Detection and identification of cis-regulatory elements using change-point and classification algorithms. BMC Genomics 2022; 23:78. [PMID: 35078412 PMCID: PMC8790847 DOI: 10.1186/s12864-021-08190-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 11/19/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Transcriptional regulation is primarily mediated by the binding of factors to non-coding regions in DNA. Identification of these binding regions enhances understanding of tissue formation and potentially facilitates the development of gene therapies. However, successful identification of binding regions is made difficult by the lack of a universal biological code for their characterisation. RESULTS We extend an alignment-based method, changept, and identify clusters of biological significance, through ontology and de novo motif analysis. Further, we apply a Bayesian method to estimate and combine binary classifiers on the clusters we identify to produce a better performing composite. CONCLUSIONS The analysis we describe provides a computational method for identification of conserved binding sites in the human genome and facilitates an alternative interrogation of combinations of existing data sets with alignment data.
Collapse
Affiliation(s)
- Dominic Maderazo
- School of Mathematics and Statistics, The University of Melbourne, Melbourne, 3010, VIC, Australia.
| | - Jennifer A Flegg
- School of Mathematics and Statistics, The University of Melbourne, Melbourne, 3010, VIC, Australia
| | - Manjula Algama
- School of Mathematics, Monash University, Melbourne, 3800, VIC, Australia
| | - Mirana Ramialison
- Australian Regenerative Medicine Institute, Monash University, Melbourne, 3800, VIC, Australia
| | - Jonathan Keith
- School of Mathematics, Monash University, Melbourne, 3800, VIC, Australia
| |
Collapse
|
3
|
Sadia F, Boyd S, Keith JM. Bayesian change-point modeling with segmented ARMA model. PLoS One 2019; 13:e0208927. [PMID: 30596668 PMCID: PMC6312324 DOI: 10.1371/journal.pone.0208927] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 11/26/2018] [Indexed: 11/18/2022] Open
Abstract
Time series segmentation aims to identify segment boundary points in a time series, and to determine the dynamical properties corresponding to each segment. To segment time series data, this article presents a Bayesian change-point model in which the data within segments follows an autoregressive moving average (ARMA) model. A prior distribution is defined for the number of change-points, their positions, segment means and error terms. To quantify uncertainty about the location of change-points, the resulting posterior probability distributions are sampled using the Generalized Gibbs sampler Markov chain Monte Carlo technique. This methodology is illustrated by applying it to simulated data and to real data known as the well-log time series data. This well-log data records the measurements of nuclear magnetic response of underground rocks during the drilling of a well. Our approach has high sensitivity, and detects a larger number of change-points than have been identified by comparable methods in the existing literature.
Collapse
Affiliation(s)
- Farhana Sadia
- School of Mathematical Sciences, Monash University, Clayton, VIC, Australia
| | - Sarah Boyd
- Faculty of Information Technology, Monash University, Clayton, VIC, Australia
| | - Jonathan M. Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC, Australia
- * E-mail:
| |
Collapse
|
4
|
Abstract
Many biological sequences have a segmental structure that can provide valuable clues to their content, structure, and function. The program changept is a tool for investigating the segmental structure of a sequence, and can also be applied to multiple sequences in parallel to identify a common segmental structure, thus providing a method for integrating multiple data types to identify functional elements in genomes. In the previous edition of this book, a command line interface for changept is described. Here we present a graphical user interface for this package, called changeptGUI. This interface also includes tools for pre- and post-processing of data and results to facilitate investigation of the number and characteristics of segment classes.
Collapse
|
5
|
Algama M, Tasker E, Williams C, Parslow AC, Bryson-Richardson RJ, Keith JM. Genome-wide identification of conserved intronic non-coding sequences using a Bayesian segmentation approach. BMC Genomics 2017; 18:259. [PMID: 28347272 PMCID: PMC5369223 DOI: 10.1186/s12864-017-3645-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 03/18/2017] [Indexed: 11/17/2022] Open
Abstract
Background Computational identification of non-coding RNAs (ncRNAs) is a challenging problem. We describe a genome-wide analysis using Bayesian segmentation to identify intronic elements highly conserved between three evolutionarily distant vertebrate species: human, mouse and zebrafish. We investigate the extent to which these elements include ncRNAs (or conserved domains of ncRNAs) and regulatory sequences. Results We identified 655 deeply conserved intronic sequences in a genome-wide analysis. We also performed a pathway-focussed analysis on genes involved in muscle development, detecting 27 intronic elements, of which 22 were not detected in the genome-wide analysis. At least 87% of the genome-wide and 70% of the pathway-focussed elements have existing annotations indicative of conserved RNA secondary structure. The expression of 26 of the pathway-focused elements was examined using RT-PCR, providing confirmation that they include expressed ncRNAs. Consistent with previous studies, these elements are significantly over-represented in the introns of transcription factors. Conclusions This study demonstrates a novel, highly effective, Bayesian approach to identifying conserved non-coding sequences. Our results complement previous findings that these sequences are enriched in transcription factors. However, in contrast to previous studies which suggest the majority of conserved sequences are regulatory factor binding sites, the majority of conserved sequences identified using our approach contain evidence of conserved RNA secondary structures, and our laboratory results suggest most are expressed. Functional roles at DNA and RNA levels are not mutually exclusive, and many of our elements possess evidence of both. Moreover, ncRNAs play roles in transcriptional and post-transcriptional regulation, and this may contribute to the over-representation of these elements in introns of transcription factors. We attribute the higher sensitivity of the pathway-focussed analysis compared to the genome-wide analysis to improved alignment quality, suggesting that enhanced genomic alignments may reveal many more conserved intronic sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3645-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Melbourne, VIC, 3800, Australia
| | - Edward Tasker
- School of Mathematical Sciences, Monash University, Melbourne, VIC, 3800, Australia
| | - Caitlin Williams
- School of Biological Sciences, Monash University, Melbourne, VIC, 3800, Australia
| | - Adam C Parslow
- School of Biological Sciences, Monash University, Melbourne, VIC, 3800, Australia
| | | | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Melbourne, VIC, 3800, Australia.
| |
Collapse
|
6
|
Hybrid algorithms for multiple change-point detection in biological sequences. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2015; 823:41-61. [PMID: 25381101 DOI: 10.1007/978-3-319-10984-8_3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Abstract
Array comparative genomic hybridization (aCGH) is one of the techniques that can be used to detect copy number variations in DNA sequences in high resolution. It has been identified that abrupt changes in the human genome play a vital role in the progression and development of many complex diseases. In this study we propose two distinct hybrid algorithms that combine efficient sequential change-point detection procedures (the Shiryaev-Roberts procedure and the cumulative sum control chart (CUSUM) procedure) with the Cross-Entropy method, which is an evolutionary stochastic optimization technique to estimate both the number of change-points and their corresponding locations in aCGH data. The proposed hybrid algorithms are applied to both artificially generated data and real aCGH experimental data to illustrate their usefulness. Our results show that the proposed methodologies are effective in detecting multiple change-points in biological sequences of continuous measurements.
Collapse
|
7
|
Discovery of putative small non-coding RNAs from the obligate intracellular bacterium Wolbachia pipientis. PLoS One 2015; 10:e0118595. [PMID: 25739023 PMCID: PMC4349823 DOI: 10.1371/journal.pone.0118595] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 01/21/2015] [Indexed: 12/18/2022] Open
Abstract
Wolbachia pipientis is an endosymbiotic bacterium that induces a wide range of effects in its insect hosts, including manipulation of reproduction and protection against pathogens. Little is known of the molecular mechanisms underlying the insect-Wolbachia interaction, though it is likely to be mediated via the secretion of proteins or other factors. There is an increasing amount of evidence that bacteria regulate many cellular processes, including secretion of virulence factors, using small non-coding RNAs (sRNAs), but sRNAs have not previously been described from Wolbachia. We have used two independent approaches, one based on comparative genomics and the other using RNA-Seq data generated for gene expression studies, to identify candidate sRNAs in Wolbachia. We experimentally characterized the expression of one of these candidates in four Wolbachia strains, and showed that it is differentially regulated in different host tissues and sexes. Given the roles played by sRNAs in other host-associated bacteria, the conservation of the candidate sRNAs between different Wolbachia strains, and the sex- and tissue-specific differential regulation we have identified, we hypothesise that sRNAs may play a significant role in the biology of Wolbachia, and in particular in its interactions with its host.
Collapse
|
8
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
9
|
Algama M, Oldmeadow C, Tasker E, Mengersen K, Keith JM. Drosophila 3' UTRs are more complex than protein-coding sequences. PLoS One 2014; 9:e97336. [PMID: 24824035 PMCID: PMC4019593 DOI: 10.1371/journal.pone.0097336] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2013] [Accepted: 04/18/2014] [Indexed: 01/03/2023] Open
Abstract
The 3′ UTRs of eukaryotic genes participate in a variety of post-transcriptional (and some transcriptional) regulatory interactions. Some of these interactions are well characterised, but an undetermined number remain to be discovered. While some regulatory sequences in 3′ UTRs may be conserved over long evolutionary time scales, others may have only ephemeral functional significance as regulatory profiles respond to changing selective pressures. Here we propose a sensitive segmentation methodology for investigating patterns of composition and conservation in 3′ UTRs based on comparison of closely related species. We describe encodings of pairwise and three-way alignments integrating information about conservation, GC content and transition/transversion ratios and apply the method to three closely related Drosophila species: D. melanogaster, D. simulans and D. yakuba. Incorporating multiple data types greatly increased the number of segment classes identified compared to similar methods based on conservation or GC content alone. We propose that the number of segments and number of types of segment identified by the method can be used as proxies for functional complexity. Our main finding is that the number of segments and segment classes identified in 3′ UTRs is greater than in the same length of protein-coding sequence, suggesting greater functional complexity in 3′ UTRs. There is thus a need for sustained and extensive efforts by bioinformaticians to delineate functional elements in this important genomic fraction. C code, data and results are available upon request.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia
| | - Christopher Oldmeadow
- School of Medicine and Public Health, University of Newcastle, Newcastle, New South Wales, Australia
| | - Edward Tasker
- School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia
| | - Kerrie Mengersen
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Jonathan M. Keith
- School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia
- * E-mail:
| |
Collapse
|
10
|
Futschik A, Hotz T, Munk A, Sieling H. Multiscale DNA partitioning: statistical evidence for segments. Bioinformatics 2014; 30:2255-62. [PMID: 24753487 DOI: 10.1093/bioinformatics/btu180] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION DNA segmentation, i.e. the partitioning of DNA in compositionally homogeneous segments, is a basic task in bioinformatics. Different algorithms have been proposed for various partitioning criteria such as Guanine/Cytosine (GC) content, local ancestry in population genetics or copy number variation. A critical component of any such method is the choice of an appropriate number of segments. Some methods use model selection criteria and do not provide a suitable error control. Other methods that are based on simulating a statistic under a null model provide suitable error control only if the correct null model is chosen. RESULTS Here, we focus on partitioning with respect to GC content and propose a new approach that provides statistical error control: as in statistical hypothesis testing, it guarantees with a user-specified probability [Formula: see text] that the number of identified segments does not exceed the number of actually present segments. The method is based on a statistical multiscale criterion, rendering this as a segmentation method that searches segments of any length (on all scales) simultaneously. It is also accurate in localizing segments: under benchmark scenarios, our approach leads to a segmentation that is more accurate than the approaches discussed in the comparative review of Elhaik et al. In our real data examples, we find segments that often correspond well to features taken from standard University of California at Santa Cruz (UCSC) genome annotation tracks. AVAILABILITY AND IMPLEMENTATION Our method is implemented in function smuceR of the R-package stepR available at http://www.stochastik.math.uni-goettingen.de/smuce.
Collapse
Affiliation(s)
- Andreas Futschik
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Thomas Hotz
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Axel Munk
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, GermanyDepartment of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| | - Hannes Sieling
- Department of Applied Statistics, JK University Linz, A-4040 Linz, Austria, Institute of Mathematics, Technische Universität Ilmenau, D-98693 Ilmenau, Germany, Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen and Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany
| |
Collapse
|
11
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
12
|
Felicioli C, Marangoni R. BpMatch: an efficient algorithm for a segmental analysis of genomic sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1120-1127. [PMID: 22350206 DOI: 10.1109/tcbb.2012.30] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is O(l²n). On the average, by setting l ≥ 2 log(d)(n), the time required to calculate the coverage is only O(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. The result of the self-covering approach is a spectral representation of the repeats contained in the sequence. BpMatch is freely available on: www.sourceforge.net/projects/bpmatch.
Collapse
|
13
|
Boyd SE, Nair B, Ng SW, Keith JM, Orian JM. Computational characterization of 3' splice variants in the GFAP isoform family. PLoS One 2012; 7:e33565. [PMID: 22479412 PMCID: PMC3316583 DOI: 10.1371/journal.pone.0033565] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2011] [Accepted: 02/16/2012] [Indexed: 12/26/2022] Open
Abstract
Glial fibrillary acidic protein (GFAP) is an intermediate filament (IF) protein specific to central nervous system (CNS) astrocytes. It has been the subject of intense interest due to its association with neurodegenerative diseases, and because of growing evidence that IF proteins not only modulate cellular structure, but also cellular function. Moreover, GFAP has a family of splicing isoforms apparently more complex than that of other CNS IF proteins, consistent with it possessing a range of functional and structural roles. The gene consists of 9 exons, and to date all isoforms associated with 3' end splicing have been identified from modifications within intron 7, resulting in the generation of exon 7a (GFAPδ/ε) and 7b (GFAPκ). To better understand the nature and functional significance of variation in this region, we used a Bayesian multiple change-point approach to identify conserved regions. This is the first successful application of this method to a single gene--it has previously only been used in whole-genome analyses. We identified several highly or moderately conserved regions throughout the intron 7/7a/7b regions, including untranslated regions and regulatory features, consistent with the biology of GFAP. Several putative unconfirmed features were also identified, including a possible new isoform. We then integrated multiple computational analyses on both the DNA and protein sequences from the mouse, rat and human, showing that the major isoform, GFAPα, has highly conserved structure and features across the three species, whereas the minor isoforms GFAPδ/ε and GFAPκ have low conservation of structure and features at the distal 3' end, both relative to each other and relative to GFAPα. The overall picture suggests distinct and tightly regulated functions for the 3' end isoforms, consistent with complex astrocyte biology. The results illustrate a computational approach for characterising splicing isoform families, using both DNA and protein sequences.
Collapse
Affiliation(s)
- Sarah E. Boyd
- School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia
| | - Betina Nair
- Department of Biochemistry, La Trobe University, Bundoora, Victoria, Australia
| | - Sze Woei Ng
- Department of Biochemistry, La Trobe University, Bundoora, Victoria, Australia
| | - Jonathan M. Keith
- School of Mathematical Sciences, Monash University, Clayton, Victoria, Australia
| | - Jacqueline M. Orian
- Department of Biochemistry, La Trobe University, Bundoora, Victoria, Australia
| |
Collapse
|
14
|
Oldmeadow C, Keith JM. Model selection in Bayesian segmentation of multiple DNA alignments. ACTA ACUST UNITED AC 2011; 27:604-10. [PMID: 21208984 DOI: 10.1093/bioinformatics/btq716] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The analysis of multiple sequence alignments is allowing researchers to glean valuable insights into evolution, as well as identify genomic regions that may be functional, or discover novel classes of functional elements. Understanding the distribution of conservation levels that constitutes the evolutionary landscape is crucial to distinguishing functional regions from non-functional. Recent evidence suggests that a binary classification of evolutionary rates is inappropriate for this purpose and finds only highly conserved functional elements. Given that the distribution of evolutionary rates is multi-modal, determining the number of modes is of paramount concern. Through simulation, we evaluate the performance of a number of information criterion approaches derived from MCMC simulations in determining the dimension of a model. RESULTS We utilize a deviance information criterion (DIC) approximation that is more robust than the approximations from other information criteria, and show our information criteria approximations do not produce superfluous modes when estimating conservation distributions under a variety of circumstances. We analyse the distribution of conservation for a multiple alignment comprising four primate species and mouse, and repeat this on two additional multiple alignments of similar species. We find evidence of six distinct classes of evolutionary rates that appear to be robust to the species used. AVAILABILITY Source code and data are available at http://dl.dropbox.com/u/477240/changept.zip.
Collapse
Affiliation(s)
- Christopher Oldmeadow
- Centre for Clinical Epidemiology and Biostatistics, University of Newcastle, NSW, Victoria, Australia.
| | | |
Collapse
|
15
|
Stochastic models for large interacting systems and related correlation inequalities. Proc Natl Acad Sci U S A 2010; 107:16413-9. [PMID: 20826441 DOI: 10.1073/pnas.1011270107] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A very large and active part of probability theory is concerned with the formulation and analysis of models for the evolution of large systems arising in the sciences, including physics and biology. These models have in their description randomness in the evolution rules, and interactions among various parts of the system. This article describes some of the main models in this area, as well as some of the major results about their behavior that have been obtained during the past 40 years. An important technique in this area, as well as in related parts of physics, is the use of correlation inequalities. These express positive or negative dependence between random quantities related to the model. In some types of models, the underlying dependence is positive, whereas in others it is negative. We give particular attention to these issues, and to applications of these inequalities. Among the applications are central limit theorems that give convergence to a Gaussian distribution.
Collapse
|
16
|
Oldmeadow C, Mengersen K, Mattick JS, Keith JM. Multiple evolutionary rate classes in animal genome evolution. Mol Biol Evol 2009; 27:942-53. [PMID: 19955480 DOI: 10.1093/molbev/msp299] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The proportion of functional sequence in the human genome is currently a subject of debate. The most widely accepted figure is that approximately 5% is under purifying selection. In Drosophila, estimates are an order of magnitude higher, though this corresponds to a similar quantity of sequence. These estimates depend on the difference between the distribution of genomewide evolutionary rates and that observed in a subset of sequences presumed to be neutrally evolving. Motivated by the widening gap between these estimates and experimental evidence of genome function, especially in mammals, we developed a sensitive technique for evaluating such distributions and found that they are much more complex than previously apparent. We found strong evidence for at least nine well-resolved evolutionary rate classes in an alignment of four Drosophila species and at least seven classes in an alignment of four mammals, including human. We also identified at least three rate classes in human ancestral repeats. By positing that the largest of these ancestral repeat classes is neutrally evolving, we estimate that the proportion of nonneutrally evolving sequence is 30% of human ancestral repeats and 45% of the aligned portion of the genome. However, we also question whether any of the classes represent neutrally evolving sequences and argue that a plausible alternative is that they reflect variable structure-function constraints operating throughout the genomes of complex organisms.
Collapse
Affiliation(s)
- Christopher Oldmeadow
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | | |
Collapse
|
17
|
Zhou Q, Wong WH. Reconstructing the energy landscape of a distribution from Monte Carlo samples. Ann Appl Stat 2008. [DOI: 10.1214/08-aoas196] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Keith JM, Adams P, Stephen S, Mattick JS. Delineating slowly and rapidly evolving fractions of the Drosophila genome. J Comput Biol 2008; 15:407-30. [PMID: 18435570 DOI: 10.1089/cmb.2007.0173] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.
Collapse
Affiliation(s)
- Jonathan M Keith
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
| | | | | | | |
Collapse
|
19
|
Abstract
Whole-genome comparisons among mammalian and other eukaryotic organisms have revealed that they contain large quantities of conserved non-protein-coding sequence. Although some of the functions of this non-coding DNA have been identified, there remains a large quantity of conserved genomic sequence that is of no known function. Moreover, the task of delineating the conserved sequences is non-trivial, particularly when some sequences are conserved in only a small number of lineages. Sequence segmentation is a statistical technique for identifying putative functional elements in genomes based on atypical sequence characteristics, such as conservation levels relative to other genomes, GC content, SNP frequency, and potentially many others. The publicly available program changept and associated programs use Bayesian multiple change-point analysis to delineate classes of genomic segments with similar characteristics, potentially representing new classes of non-coding RNAs (contact web site: http://silmaril.math.sci.qut.edu.au/~keith/) .
Collapse
|