1
|
Rosebrock D, Vingron M, Arndt PF. Modeling gene expression cascades during cell state transitions. iScience 2024; 27:109386. [PMID: 38500834 PMCID: PMC10946328 DOI: 10.1016/j.isci.2024.109386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 12/14/2023] [Accepted: 02/27/2024] [Indexed: 03/20/2024] Open
Abstract
During cellular processes such as differentiation or response to external stimuli, cells exhibit dynamic changes in their gene expression profiles. Single-cell RNA sequencing (scRNA-seq) can be used to investigate these dynamic changes. To this end, cells are typically ordered along a pseudotemporal trajectory which recapitulates the progression of cells as they transition from one cell state to another. We infer transcriptional dynamics by modeling the gene expression profiles in pseudotemporally ordered cells using a Bayesian inference approach. This enables ordering genes along transcriptional cascades, estimating differences in the timing of gene expression dynamics, and deducing regulatory gene interactions. Here, we apply this approach to scRNA-seq datasets derived from mouse embryonic forebrain and pancreas samples. This analysis demonstrates the utility of the method to derive the ordering of gene dynamics and regulatory relationships critical for proper cellular differentiation and maturation across a variety of developmental contexts.
Collapse
Affiliation(s)
- Daniel Rosebrock
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Peter F. Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| |
Collapse
|
2
|
Sheinman M, Arndt PF, Massip F. Modeling the mosaic structure of bacterial genomes to infer their evolutionary history. Proc Natl Acad Sci U S A 2024; 121:e2313367121. [PMID: 38517978 PMCID: PMC10990148 DOI: 10.1073/pnas.2313367121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 01/30/2024] [Indexed: 03/24/2024] Open
Abstract
The chronology and phylogeny of bacterial evolution are difficult to reconstruct due to a scarce fossil record. The analysis of bacterial genomes remains challenging because of large sequence divergence, the plasticity of bacterial genomes due to frequent gene loss, horizontal gene transfer, and differences in selective pressure from one locus to another. Therefore, taking advantage of the rich and rapidly accumulating genomic data requires accurate modeling of genome evolution. An important technical consideration is that loci with high effective mutation rates may diverge beyond the detection limit of the alignment algorithms used, biasing the genome-wide divergence estimates toward smaller divergences. In this article, we propose a novel method to gain insight into bacterial evolution based on statistical properties of genome comparisons. We find that the length distribution of sequence matches is shaped by the effective mutation rates of different loci, by the horizontal transfers, and by the aligner sensitivity. Based on these inputs, we build a model and show that it accounts for the empirically observed distributions, taking the Enterobacteriaceae family as an example. Our method allows to distinguish segments of vertical and horizontal origins and to estimate the time divergence and exchange rate between any pair of taxa from genome-wide alignments. Based on the estimated time divergences, we construct a time-calibrated phylogenetic tree to demonstrate the accuracy of the method.
Collapse
Affiliation(s)
- Michael Sheinman
- Institute for Advanced Studies, Sevastopol State University, Sevastopol299053, Crimea
| | - Peter F. Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin12163, Germany
| | - Florian Massip
- Department U900, Centre for Computational Biology, Mines Paris, PSL University, Paris75006, France
- Department U900, Institut Curie, Université Paris Sciences et Lettres, Paris75005, France
- INSERM, U900, Paris75005, France
| |
Collapse
|
3
|
Rosebrock D, Arora S, Mutukula N, Volkman R, Gralinska E, Balaskas A, Aragonés Hernández A, Buschow R, Brändl B, Müller FJ, Arndt PF, Vingron M, Elkabetz Y. Enhanced cortical neural stem cell identity through short SMAD and WNT inhibition in human cerebral organoids facilitates emergence of outer radial glial cells. Nat Cell Biol 2022; 24:981-995. [PMID: 35697781 PMCID: PMC9203281 DOI: 10.1038/s41556-022-00929-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Accepted: 04/28/2022] [Indexed: 12/11/2022]
Abstract
Cerebral organoids exhibit broad regional heterogeneity accompanied by limited cortical cellular diversity despite the tremendous upsurge in derivation methods, suggesting inadequate patterning of early neural stem cells (NSCs). Here we show that a short and early Dual SMAD and WNT inhibition course is necessary and sufficient to establish robust and lasting cortical organoid NSC identity, efficiently suppressing non-cortical NSC fates, while other widely used methods are inconsistent in their cortical NSC-specification capacity. Accordingly, this method selectively enriches for outer radial glia NSCs, which cyto-architecturally demarcate well-defined outer sub-ventricular-like regions propagating from superiorly radially organized, apical cortical rosette NSCs. Finally, this method culminates in the emergence of molecularly distinct deep and upper cortical layer neurons, and reliably uncovers cortex-specific microcephaly defects. Thus, a short SMAD and WNT inhibition is critical for establishing a rich cortical cell repertoire that enables mirroring of fundamental molecular and cyto-architectural features of cortical development and meaningful disease modelling. Rosebrock, Arora et al. report a method to overcome limited cortical cellular diversity in human organoids, thus mirroring fundamental features of cortical development and offering a basis for organoid-based disease modelling.
Collapse
Affiliation(s)
- Daniel Rosebrock
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Sneha Arora
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Cell and Developmental Biology, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel.,Institute of Biology, Department of Biology, Chemistry, and Pharmacy, Freie Universität Berlin, Berlin, Germany
| | - Naresh Mutukula
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Cell and Developmental Biology, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel.,Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, Berlin, Germany
| | - Rotem Volkman
- Department of Cell and Developmental Biology, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Elzbieta Gralinska
- Department of Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | - Anastasios Balaskas
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, Berlin, Germany
| | - Amèlia Aragonés Hernández
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Institute of Biology, Department of Biology, Chemistry, and Pharmacy, Freie Universität Berlin, Berlin, Germany
| | - René Buschow
- Microscopy and Cryo-Electron Microscopy, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Björn Brändl
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Psychiatry and Psychotherapy, University Hospital Schleswig Holstein, Kiel, Germany
| | - Franz-Josef Müller
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany.,Department of Psychiatry and Psychotherapy, University Hospital Schleswig Holstein, Kiel, Germany
| | - Peter F Arndt
- Department of Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Martin Vingron
- Department of Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Yechiel Elkabetz
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin, Germany. .,Department of Cell and Developmental Biology, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
4
|
Abstract
Background Segmental duplications (SDs) are long DNA sequences that are repeated in a genome and have high sequence identity. In contrast to repetitive elements they are often unique and only sometimes have multiple copies in a genome. There are several well-studied mechanisms responsible for segmental duplications: non-allelic homologous recombination, non-homologous end joining and replication slippage. Such duplications play an important role in evolution, however, we do not have a full understanding of the dynamic properties of the duplication process. Results We study segmental duplications through a graph representation where nodes represent genomic regions and edges represent duplications between them. The resulting network (the SD network) is quite complex and has distinct features which allow us to make inference on the evolution of segmantal duplications. We come up with the network growth model that explains features of the SD network thus giving us insights on dynamics of segmental duplications in the human genome. Based on our analysis of genomes of other species the network growth model seems to be applicable for multiple mammalian genomes. Conclusions Our analysis suggests that duplication rates of genomic loci grow linearly with the number of copies of a duplicated region. Several scenarios explaining such a preferential duplication rates were suggested. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07789-7).
Collapse
Affiliation(s)
- Eldar T Abdullaev
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63/73, Berlin, 14195, Germany.
| | - Iren R Umarova
- Faculty of Computational Mathematics and Cybernetics, Moscow State University, Leninskiye Gory 1-52, Moscow, 119991, Russia
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63/73, Berlin, 14195, Germany
| |
Collapse
|
5
|
Sheinman M, Arkhipova K, Arndt PF, Dutilh BE, Hermsen R, Massip F. Identical sequences found in distant genomes reveal frequent horizontal transfer across the bacterial domain. eLife 2021; 10:62719. [PMID: 34121661 PMCID: PMC8270642 DOI: 10.7554/elife.62719] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 06/13/2021] [Indexed: 12/19/2022] Open
Abstract
Horizontal gene transfer (HGT) is an essential force in microbial evolution. Despite detailed studies on a variety of systems, a global picture of HGT in the microbial world is still missing. Here, we exploit that HGT creates long identical DNA sequences in the genomes of distant species, which can be found efficiently using alignment-free methods. Our pairwise analysis of 93,481 bacterial genomes identified 138,273 HGT events. We developed a model to explain their statistical properties as well as estimate the transfer rate between pairs of taxa. This reveals that long-distance HGT is frequent: our results indicate that HGT between species from different phyla has occurred in at least 8% of the species. Finally, our results confirm that the function of sequences strongly impacts their transfer rate, which varies by more than three orders of magnitude between different functional categories. Overall, we provide a comprehensive view of HGT, illuminating a fundamental process driving bacterial evolution.
Collapse
Affiliation(s)
- Michael Sheinman
- Theoretical Biology and Bioinformatics, Biology Department, Utrecht University, Utrecht, Netherlands.,Division of Molecular Carcinogenesis, the Netherlands Cancer Institute, Amsterdam, Netherlands
| | - Ksenia Arkhipova
- Theoretical Biology and Bioinformatics, Biology Department, Utrecht University, Utrecht, Netherlands
| | - Peter F Arndt
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Biology Department, Utrecht University, Utrecht, Netherlands
| | - Rutger Hermsen
- Theoretical Biology and Bioinformatics, Biology Department, Utrecht University, Utrecht, Netherlands
| | - Florian Massip
- Berlin Institute for Medical Systems Biology, Max Delbrück Center, Berlin, Germany.,Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR 5558, Villleurbanne, France
| |
Collapse
|
6
|
Kubler K, Karlic R, Haradhvala NJ, Ha K, Kim J, Kuzman M, Jiao W, Gakkhar S, Mouw KW, Braunstein LZ, Elemento O, Biankin AV, Rooman I, Miller M, Nogiec CD, Curry E, Mino-Kenudson M, Ellisen LW, Brown R, Gusev A, Tomasetti C, Kim HG, Lee H, Vlahovicek K, Sawyers C, Hoadley KA, Cuppen E, Koren A, Arndt PF, Louis DN, Stein L, Foulkes WD, Polak P, Getz G. Abstract 2727: The premalignant state captured in the landscape of somatic mutations can reveal the cancer cell-of-origin. Cancer Res 2019. [DOI: 10.1158/1538-7445.am2019-2727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Despite increasing knowledge of tumorigenesis, the identity of the cancer cell-of-origin, i.e. the normal cell type that acquired the cancer-initiating event, remains largely unknown. Our approach of identifying the cell-of-origin is based on two observations: (1) the chromatin structure is cell-specific; and (2) the density of somatic mutations along the genome is associated with the regional profile of chromatin modifications.
We have previously developed a method that quantifies the ability to predict the mutational distribution along the cancer genome from the profile of epigenetic modifications in different normal cell types. Here we present the largest application of our method using 2,550 whole genomes representing 32 distinct cancer types. To identify the cell-of-origin, we determined the correlation between the observed density of mutations along the genome and the predicted values based on chromatin modifications from 104 different normal tissue types. The normal cell type that showed the strongest correlation with a specific cancer mutational landscape was the candidate cell-of-origin.
We found that in almost all cancer types the cell-of-origin can be characterized solely from DNA sequences. Interestingly, we found that the fallopian tube was the best match for high-grade serous ovarian cancer, providing independent evidence that this is the cancer’s site of origin. For breast cancer we found that the four distinct subtypes best-matched cells from the luminal cell lineage: basal-like breast cancer likely originates from luminal progenitors, whereas all other subtypes from luminal mature cells. This association holds true even when accounting for different alterations in the homologous recombination repair pathway, suggesting that subtypes are more determined by the cell-of-origin than the specific DNA repair defect. In addition, we found that we could identify the cell-of-origin using metastatic samples – a finding that may help in difficult clinical diagnoses. Moreover, we demonstrate that cancer drivers, both germline risk alleles and somatically mutated drivers, reside in active chromatin regions in the respective cell-of-origin.
Taken together, our findings indicate that many of the somatic mutations accumulated while the cells maintained a chromatin structure similar to the cell-of-origin (likely occurring prior to transformation). Therefore, this historical record, captured in the DNA, can be used to identify, the often elusive, cancer cell-of-origin. Our approach can ultimately help better understand the potential of particular normal cell types to transform and initiate cancer, as well as the association of the cell-of-origin with tumor subtypes and sensitivity to treatment.
Citation Format: Kirsten Kubler, Rosa Karlic, Nicholas J. Haradhvala, Kyungsik Ha, Jaegil Kim, Maja Kuzman, Wei Jiao, Sitanshu Gakkhar, Kent W. Mouw, Lior Z. Braunstein, Olivier Elemento, Andrew V. Biankin, Ilse Rooman, Mendy Miller, Christopher D. Nogiec, Edward Curry, Mari Mino-Kenudson, Leif W. Ellisen, Robert Brown, Alexander Gusev, Cristian Tomasetti, Hong-Gee Kim, Hwajin Lee, Kristian Vlahovicek, Charles Sawyers, Katherine A. Hoadley, Edwin Cuppen, Amnon Koren, Peter F. Arndt, David N. Louis, Lincoln Stein, William D. Foulkes, Paz Polak, Gad Getz. The premalignant state captured in the landscape of somatic mutations can reveal the cancer cell-of-origin [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 2727.
Collapse
Affiliation(s)
| | | | | | - Kyungsik Ha
- 3Seoul National University, Republic of Korea
| | - Jaegil Kim
- 1The Broad Institute of MIT and Harvard, Cambridge, MA
| | | | - Wei Jiao
- 4Ontario Institute for Cancer Research, Ontario, Canada
| | - Sitanshu Gakkhar
- 5Canada’s Michael Smith Genome Sciences Centre, British Columbia, Canada
| | - Kent W. Mouw
- 6Brigham & Women’s Hospital and Dana Farber Cancer Institute, MA
| | | | | | | | | | - Mendy Miller
- 1The Broad Institute of MIT and Harvard, Cambridge, MA
| | | | | | | | | | | | - Alexander Gusev
- 14Brigham and Women’s Hospital & Dana Farber Cancer Institute, MA
| | | | | | - Hwajin Lee
- 3Seoul National University, Republic of Korea
| | | | | | | | | | | | | | | | - Lincoln Stein
- 4Ontario Institute for Cancer Research, Ontario, Canada
| | | | | | - Gad Getz
- 1The Broad Institute of MIT and Harvard, Cambridge, MA
| |
Collapse
|
7
|
Gopal RK, Kübler K, Calvo SE, Polak P, Livitz D, Rosebrock D, Sadow PM, Campbell B, Donovan SE, Amin S, Gigliotti BJ, Grabarek Z, Hess JM, Stewart C, Braunstein LZ, Arndt PF, Mordecai S, Shih AR, Chaves F, Zhan T, Lubitz CC, Kim J, Iafrate AJ, Wirth L, Parangi S, Leshchiner I, Daniels GH, Mootha VK, Dias-Santagata D, Getz G, McFadden DG. Widespread Chromosomal Losses and Mitochondrial DNA Alterations as Genetic Drivers in Hürthle Cell Carcinoma. Cancer Cell 2018; 34:242-255.e5. [PMID: 30107175 PMCID: PMC6121811 DOI: 10.1016/j.ccell.2018.06.013] [Citation(s) in RCA: 155] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/31/2017] [Revised: 03/30/2018] [Accepted: 06/27/2018] [Indexed: 12/24/2022]
Abstract
Hürthle cell carcinoma of the thyroid (HCC) is a form of thyroid cancer recalcitrant to radioiodine therapy that exhibits an accumulation of mitochondria. We performed whole-exome sequencing on a cohort of primary, recurrent, and metastatic tumors, and identified recurrent mutations in DAXX, TP53, NRAS, NF1, CDKN1A, ARHGAP35, and the TERT promoter. Parallel analysis of mtDNA revealed recurrent homoplasmic mutations in subunits of complex I of the electron transport chain. Analysis of DNA copy-number alterations uncovered widespread loss of chromosomes culminating in near-haploid chromosomal content in a large fraction of HCC, which was maintained during metastatic spread. This work uncovers a distinct molecular origin of HCC compared with other thyroid malignancies.
Collapse
Affiliation(s)
- Raj K Gopal
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Howard Hughes Medical Institute, Chevy Chase, MD, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Kirsten Kübler
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Sarah E Calvo
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Paz Polak
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Dimitri Livitz
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | | | - Peter M Sadow
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Braidie Campbell
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Samuel E Donovan
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Salma Amin
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Surgery, Massachusetts General Hospital, Boston, MA 02114, USA
| | | | - Zenon Grabarek
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Julian M Hess
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Chip Stewart
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | | | - Peter F Arndt
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Scott Mordecai
- Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Angela R Shih
- Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Frances Chaves
- Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Tiannan Zhan
- Institute for Technology Assessment, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Carrie C Lubitz
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Surgery, Massachusetts General Hospital, Boston, MA 02114, USA; Institute for Technology Assessment, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Jiwoong Kim
- Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - A John Iafrate
- Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Lori Wirth
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Sareh Parangi
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Surgery, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | | | - Gilbert H Daniels
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Thyroid Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Vamsi K Mootha
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Howard Hughes Medical Institute, Chevy Chase, MD, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Dora Dias-Santagata
- Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA
| | - Gad Getz
- Cancer Center, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Boston, MA 02115, USA.
| | - David G McFadden
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Thyroid Unit, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Internal Medicine, Division of Endocrinology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.
| |
Collapse
|
8
|
Smith TCA, Arndt PF, Eyre-Walker A. Large scale variation in the rate of germ-line de novo mutation, base composition, divergence and diversity in humans. PLoS Genet 2018; 14:e1007254. [PMID: 29590096 PMCID: PMC5891062 DOI: 10.1371/journal.pgen.1007254] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Revised: 04/09/2018] [Accepted: 02/13/2018] [Indexed: 01/17/2023] Open
Abstract
It has long been suspected that the rate of mutation varies across the human genome at a large scale based on the divergence between humans and other species. However, it is now possible to directly investigate this question using the large number of de novo mutations (DNMs) that have been discovered in humans through the sequencing of trios. We investigate a number of questions pertaining to the distribution of mutations using more than 130,000 DNMs from three large datasets. We demonstrate that the amount and pattern of variation differs between datasets at the 1MB and 100KB scales probably as a consequence of differences in sequencing technology and processing. In particular, datasets show different patterns of correlation to genomic variables such as replication time. Never-the-less there are many commonalities between datasets, which likely represent true patterns. We show that there is variation in the mutation rate at the 100KB, 1MB and 10MB scale that cannot be explained by variation at smaller scales, however the level of this variation is modest at large scales-at the 1MB scale we infer that ~90% of regions have a mutation rate within 50% of the mean. Different types of mutation show similar levels of variation and appear to vary in concert which suggests the pattern of mutation is relatively constant across the genome. We demonstrate that variation in the mutation rate does not generate large-scale variation in GC-content, and hence that mutation bias does not maintain the isochore structure of the human genome. We find that genomic features explain less than 40% of the explainable variance in the rate of DNM. As expected the rate of divergence between species is correlated to the rate of DNM. However, the correlations are weaker than expected if all the variation in divergence was due to variation in the mutation rate. We provide evidence that this is due the effect of biased gene conversion on the probability that a mutation will become fixed. In contrast to divergence, we find that most of the variation in diversity can be explained by variation in the mutation rate. Finally, we show that the correlation between divergence and DNM density declines as increasingly divergent species are considered.
Collapse
Affiliation(s)
| | - Peter F. Arndt
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Adam Eyre-Walker
- School of Life Sciences, University of Sussex, Brighton, United Kingdom
| |
Collapse
|
9
|
Kuruoglu EE, Arndt PF. The information capacity of the genetic code: Is the natural code optimal? J Theor Biol 2017; 419:227-237. [PMID: 28163008 DOI: 10.1016/j.jtbi.2017.01.046] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Revised: 01/25/2017] [Accepted: 01/31/2017] [Indexed: 10/20/2022]
Abstract
We envision the molecular evolution process as an information transfer process and provide a quantitative measure for information preservation in terms of the channel capacity according to the channel coding theorem of Shannon. We calculate Information capacities of DNA on the nucleotide (for non-coding DNA) and the amino acid (for coding DNA) level using various substitution models. We extend our results on coding DNA to a discussion about the optimality of the natural codon-amino acid code. We provide the results of an adaptive search algorithm in the code domain and demonstrate the existence of a large number of genetic codes with higher information capacity. Our results support the hypothesis of an ancient extension from a 2-nucleotide codon to the current 3-nucleotide codon code to encode the various amino acids.
Collapse
Affiliation(s)
- Ercan E Kuruoglu
- Institute of Information Science and Technologies, "A. Faedo", CNR, via G Moruzzi 1, 56124 Pisa, Italy.
| | - Peter F Arndt
- Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Ihnestr. 63/73, 14195 Berlin, Germany
| |
Collapse
|
10
|
Imkeller K, Arndt PF, Wardemann H, Busse CE. sciReptor: analysis of single-cell level immunoglobulin repertoires. BMC Bioinformatics 2016; 17:67. [PMID: 26847109 PMCID: PMC4743164 DOI: 10.1186/s12859-016-0920-1] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 01/29/2016] [Indexed: 11/10/2022] Open
Abstract
Background The sequencing of immunoglobulin (Ig) transcripts from single B cells yields essential information about Ig heavy:light chain pairing, which is lost in conventional bulk sequencing experiments. The previously limited throughput of single-cell approaches has recently been overcome by the introduction of multiple next-generation sequencing (NGS)-based platforms. Furthermore, single-cell techniques allow the assignment of additional data types (e.g. cell surface marker expression), which are crucial for biological interpretation. However, the currently available computational tools are not designed to handle single-cell data and do not provide integral solutions for linking of sequence data to other biological data. Results Here we introduce sciReptor, a flexible toolkit for the processing and analysis of antigen receptor repertoire sequencing data at single-cell level. The software combines bioinformatics tools for immunoglobulin sequence annotation with a relational database, where raw data and analysis results are stored and linked. sciReptor supports attribution of additional data categories such as cell surface marker expression or immunological metadata. Furthermore, it comprises a quality control module as well as basic repertoire visualization tools. Conclusion sciReptor is a flexible framework for standardized sequence analysis of antigen receptor repertoires on single-cell level. The relational database allows easy data sharing and downstream analyses as well as immediate comparisons between different data sets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0920-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Katharina Imkeller
- Division of B Cell Immunology, German Cancer Research Center, Feld 280, Heidelberg, 69120, Germany.
| | - Peter F Arndt
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, Berlin, 14195, Germany.
| | - Hedda Wardemann
- Division of B Cell Immunology, German Cancer Research Center, Feld 280, Heidelberg, 69120, Germany.
| | - Christian E Busse
- Division of B Cell Immunology, German Cancer Research Center, Feld 280, Heidelberg, 69120, Germany.
| |
Collapse
|
11
|
Glémin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L. Quantification of GC-biased gene conversion in the human genome. Genome Res 2015; 25:1215-28. [PMID: 25995268 PMCID: PMC4510005 DOI: 10.1101/gr.185488.114] [Citation(s) in RCA: 108] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 05/18/2015] [Indexed: 11/25/2022]
Abstract
Much evidence indicates that GC-biased gene conversion (gBGC) has a major impact on the evolution of mammalian genomes. However, a detailed quantification of the process is still lacking. The strength of gBGC can be measured from the analysis of derived allele frequency spectra (DAF), but this approach is sensitive to a number of confounding factors. In particular, we show by simulations that the inference is pervasively affected by polymorphism polarization errors and by spatial heterogeneity in gBGC strength. We propose a new general method to quantify gBGC from DAF spectra, incorporating polarization errors, taking spatial heterogeneity into account, and jointly estimating mutation bias. Applying it to human polymorphism data from the 1000 Genomes Project, we show that the strength of gBGC does not differ between hypermutable CpG sites and non-CpG sites, suggesting that in humans gBGC is not caused by the base-excision repair machinery. Genome-wide, the intensity of gBGC is in the nearly neutral area. However, given that recombination occurs primarily within recombination hotspots, 1%–2% of the human genome is subject to strong gBGC. On average, gBGC is stronger in African than in non-African populations, reflecting differences in effective population sizes. However, due to more heterogeneous recombination landscapes, the fraction of the genome affected by strong gBGC is larger in non-African than in African populations. Given that the location of recombination hotspots evolves very rapidly, our analysis predicts that, in the long term, a large fraction of the genome is affected by short episodes of strong gBGC.
Collapse
Affiliation(s)
- Sylvain Glémin
- Institut des Sciences de l'Evolution (ISEM - UMR 5554 Université de Montpellier-CNRS-IRD-EPHE), 34095 Montpellier, France; Department of Ecology and Genetics, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Philipp W Messer
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA
| | - Dmitri Petrov
- Department of Biology, Stanford University, Stanford, California 94305-5020, USA
| | - Nicolas Galtier
- Institut des Sciences de l'Evolution (ISEM - UMR 5554 Université de Montpellier-CNRS-IRD-EPHE), 34095 Montpellier, France
| | - Laurent Duret
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne, France
| |
Collapse
|
12
|
Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PIW, Sunyaev SR. Genome-wide patterns and properties of de novo mutations in humans. Nat Genet 2015; 47:822-826. [PMID: 25985141 PMCID: PMC4485564 DOI: 10.1038/ng.3292] [Citation(s) in RCA: 247] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 04/07/2015] [Indexed: 12/12/2022]
Abstract
Mutations create variation in the population, fuel evolution, and cause genetic diseases. Current knowledge about de novo mutations is incomplete and mostly indirect 1–10. Here, we analyze 11,020 de novo mutations from whole-genomes of 250 families. We show that de novo mutations in offspring of older fathers are not only more numerous 11–13 but also occur more frequently in early-replicating, genic regions. Functional regions exhibit higher mutation rates due to CpG dinucleotides and reveal signatures of transcription-coupled repair, while mutation clusters with a unique signature point to a novel mutational mechanism. Mutation and recombination rates independently associate with nucleotide diversity, and regional variation in human-chimpanzee divergence is only partly explained by mutation rate heterogeneity. Finally, we provide a genome-wide mutation rate map for medical and population genetics applications. Our results reveal novel insights and refine long-standing hypotheses about human mutagenesis.
Collapse
Affiliation(s)
- Laurent C Francioli
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Paz P Polak
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Amnon Koren
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Androniki Menelaou
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Sung Chun
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Ivo Renkens
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | | | | | - Morris Swertz
- University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands.,University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands
| | - Cisca Wijmenga
- University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands.,University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands
| | - Gertjan van Ommen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - P Eline Slagboom
- Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Kai Ye
- Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.,The Genome Institute, Washington University, St. Louis, MO, USA
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Wigard P Kloosterman
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Paul I W de Bakker
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands.,Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Shamil R Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
13
|
Abstract
A Yule tree is the result of a branching process with constant birth and death rates. Such a process serves as an instructive null model of many empirical systems, for instance, the evolution of species leading to a phylogenetic tree. However, often in phylogeny the only available information is the pairwise distances between a small fraction of extant species representing the leaves of the tree. In this article we study statistical properties of the pairwise distances in a Yule tree. Using a method based on a recursion, we derive an exact, analytic and compact formula for the expected number of pairs separated by a certain time distance. This number turns out to follow a increasing exponential function. This property of a Yule tree can serve as a simple test for empirical data to be well described by a Yule process. We further use this recursive method to calculate the expected number of the n-most closely related pairs of leaves and the number of cherries separated by a certain time distance. To make our results more useful for realistic scenarios, we explicitly take into account that the leaves of a tree may be incompletely sampled and derive a criterion for poorly sampled phylogenies. We show that our result can account for empirical data, using two families of birds species.
Collapse
Affiliation(s)
- Michael Sheinman
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- * E-mail:
| | - Florian Massip
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- INRA, UR1077 Unite Mathematique Informatique et Genome, Jouy-en-Josas, France
| | - Peter F. Arndt
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| |
Collapse
|
14
|
Abstract
The positive-regulatory domain containing nine gene, PRDM9, which strongly associates with the location of recombination events in several vertebrates, is inferred to be inactive in the dog genome. Here, we address several questions regarding the control of recombination and its influence on genome evolution in dogs. First, we address whether the association between CpG islands (CGIs) and recombination hotspots is generated by lack of methylation, GC-biased gene conversion (gBGC), or both. Using a genome-wide dog single nucleotide polymorphism data set and comparisons of the dog genome with related species, we show that recombination-associated CGIs have low CpG mutation rates, and that CpG mutation rate is negatively correlated with recombination rate genome wide, indicating that nonmethylation attracts the recombination machinery. We next use a neighbor-dependent model of nucleotide substitution to disentangle the effects of CpG mutability and gBGC and analyze the effects that loss of PRDM9 has on these rates. We infer that methylation patterns have been stable during canid genome evolution, but that dog CGIs have experienced a drastic increase in substitution rate due to gBGC, consistent with increased levels of recombination in these regions. We also show that gBGC is likely to have generated many new CGIs in the dog genome, but these mostly occur away from genes, whereas the number of CGIs in gene promoter regions has not increased greatly in recent evolutionary history. Recombination has a major impact on the distribution of CGIs that are detected in the dog genome due to the interaction between methylation and gBGC. The results indicate that germline methylation patterns are the main determinant of recombination rates in the absence of PRDM9.
Collapse
Affiliation(s)
- Jonas Berglund
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Sweden
| | - Javier Quilez
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Sweden
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Matthew T Webster
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Sweden
| |
Collapse
|
15
|
Abstract
Genome evolution is shaped by a multitude of mutational processes, including point mutations, insertions, and deletions of DNA sequences, as well as segmental duplications. These mutational processes can leave distinctive qualitative marks in the statistical features of genomic DNA sequences. One such feature is the match length distribution (MLD) of exactly matching sequence segments within an individual genome or between the genomes of related species. These have been observed to exhibit characteristic power law decays in many species. Here, we show that simple dynamical models consisting solely of duplication and mutation processes can already explain the characteristic features of MLDs observed in genomic sequences. Surprisingly, we find that these features are largely insensitive to details of the underlying mutational processes and do not necessarily rely on the action of natural selection. Our results demonstrate how analyzing statistical features of DNA sequences can help us reveal and quantify the different mutational processes that underlie genome evolution.
Collapse
Affiliation(s)
- Florian Massip
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany UR1077, Unite Mathematiques Informatique et Genome, INRA, domaine de Vilvert, Jouy-en-Josas, France
| | - Michael Sheinman
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Sophie Schbath
- UR1077, Unite Mathematiques Informatique et Genome, INRA, domaine de Vilvert, Jouy-en-Josas, France
| | - Peter F Arndt
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| |
Collapse
|
16
|
Ebert G, Steininger A, Weißmann R, Boldt V, Lind-Thomsen A, Grune J, Badelt S, Heßler M, Peiser M, Hitzler M, Jensen LR, Müller I, Hu H, Arndt PF, Kuss AW, Tebel K, Ullmann R. Distribution of segmental duplications in the context of higher order chromatin organisation of human chromosome 7. BMC Genomics 2014; 15:537. [PMID: 24973960 PMCID: PMC4092221 DOI: 10.1186/1471-2164-15-537] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 06/17/2014] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Segmental duplications (SDs) are not evenly distributed along chromosomes. The reasons for this biased susceptibility to SD insertion are poorly understood. Accumulation of SDs is associated with increased genomic instability, which can lead to structural variants and genomic disorders such as the Williams-Beuren syndrome. Despite these adverse effects, SDs have become fixed in the human genome. Focusing on chromosome 7, which is particularly rich in interstitial SDs, we have investigated the distribution of SDs in the context of evolution and the three dimensional organisation of the chromosome in order to gain insights into the mutual relationship of SDs and chromatin topology. RESULTS Intrachromosomal SDs preferentially accumulate in those segments of chromosome 7 that are homologous to marmoset chromosome 2. Although this formerly compact segment has been re-distributed to three different sites during primate evolution, we can show by means of public data on long distance chromatin interactions that these three intervals, and consequently the paralogous SDs mapping to them, have retained their spatial proximity in the nucleus. Focusing on SD clusters implicated in the aetiology of the Williams-Beuren syndrome locus we demonstrate by cross-species comparison that these SDs have inserted at the borders of a topological domain and that they flank regions with distinct DNA conformation. CONCLUSIONS Our study suggests a link of nuclear architecture and the propagation of SDs across chromosome 7, either by promoting regional SD insertion or by contributing to the establishment of higher order chromatin organisation themselves. The latter could compensate for the high risk of structural rearrangements and thus may have contributed to their evolutionary fixation in the human genome.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Reinhard Ullmann
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany.
| |
Collapse
|
17
|
Busse CE, Czogiel I, Braun P, Arndt PF, Wardemann H. Single-cell based high-throughput sequencing of full-length immunoglobulin heavy and light chain genes. Eur J Immunol 2013; 44:597-603. [PMID: 24114719 DOI: 10.1002/eji.201343917] [Citation(s) in RCA: 92] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 08/27/2013] [Accepted: 09/19/2013] [Indexed: 11/09/2022]
Abstract
Single-cell PCR and sequencing of full-length Ig heavy (Igh) and Igk and Igl light chain genes is a powerful tool to measure the diversity of antibody repertoires and allows the functional assessment of B-cell responses through direct Ig gene cloning and the generation of recombinant mAbs. However, the current methodology is not high-throughput compatible. Here we developed a two-dimensional bar-coded primer matrix to combine Igh and Igk/Igl chain gene single-cell PCR with next-generation sequencing for the parallel analysis of the antibody repertoire of over 46 000 individual B cells. Our approach provides full-length Igh and corresponding Igk/Igl chain gene-sequence information and permits the accurate correction of sequencing errors by consensus building. The use of indexed cell sorting for the isolation of single B cells enables the integration of flow cytometry and Ig gene sequence information. The strategy is fully compatible with established protocols for direct antibody gene cloning and expression and therefore advances over previously described high-throughput approaches to assess antibody repertoires at the single-cell level.
Collapse
Affiliation(s)
- Christian E Busse
- Research Group Molecular Immunology, Max Planck Institute for Infection Biology, Berlin, Germany
| | | | | | | | | |
Collapse
|
18
|
Abstract
Meiotic recombination is known to influence GC-content evolution in large regions of mammalian genomes by favoring the fixation of G and C alleles and increasing the rate of A/T to G/C substitutions. This process is known as GC-biased gene conversion (gBGC). Until recently, genome-wide measures of fine-scale recombination activity were unavailable in mice. Additionally, comparative studies focusing on mouse were limited as the closest organism with its genome fully sequenced was rat. Here, we make use of the recent mapping of double strand breaks (DSBs), the first step of meiotic recombination, in the mouse genome and of the sequencing of mouse closely related subspecies to analyze the fine-scale evolutionary signature of meiotic recombination on GC-content evolution in recombination hotspots, short regions that undergo extreme rates of recombination. We measure substitution rates around DSB hotspots and observe that gBGC is affecting a very short region (≈ 1 kbp) in length around these hotspots. Furthermore, we can infer that the locations of hotspots evolved rapidly during mouse evolution.
Collapse
Affiliation(s)
- Yves Clément
- Montpellier SupAgro, Unité Mixte de Recherche 1334, Amélioration Génétique et Adaptation des Plantes Méditerranéennes et Tropicales, Montpellier, France
| | | |
Collapse
|
19
|
Massip F, Arndt PF. Neutral evolution of duplicated DNA: an evolutionary stick-breaking process causes scale-invariant behavior. Phys Rev Lett 2013; 110:148101. [PMID: 25167038 DOI: 10.1103/physrevlett.110.148101] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Indexed: 06/03/2023]
Abstract
Recently, an enrichment of identical matching sequences has been found in many eukaryotic genomes. Their length distribution exhibits a power law tail raising the question of what evolutionary mechanism or functional constraints would be able to shape this distribution. Here we introduce a simple and evolutionarily neutral model, which involves only point mutations and segmental duplications, and produces the same statistical features as observed for genomic data. Further, we extend a mathematical model for random stick breaking to analytically show that the exponent of the power law tail is -3 and universal as it does not depend on the microscopic details of the model.
Collapse
Affiliation(s)
- Florian Massip
- Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Peter F Arndt
- Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| |
Collapse
|
20
|
Abstract
The genomes of many vertebrates show a characteristic heterogeneous distribution of GC content, the so-called GC isochore structure. The origin of isochores has been explained via the mechanism of GC-biased gene conversion (gBGC). However, although the isochore structure is declining in many mammalian genomes, the heterogeneity in GC content is being reinforced in the avian genome. Despite this discrepancy, which remains unexplained, examinations of individual substitution frequencies in mammals and birds are both consistent with the gBGC model of isochore evolution. On the other hand, a negative correlation between substitution and recombination rate found in the chicken genome is inconsistent with the gBGC model. It should therefore be important to consider along with gBGC other consequences of recombination on the origin and fate of mutations, as well as to account for relationships between recombination rate and other genomic features. We therefore developed an analytical model to describe the substitution patterns found in the chicken genome, and further investigated the relationships between substitution patterns and several genomic features in a rigorous statistical framework. Our analysis indicates that GC content itself, either directly or indirectly via interrelations to other genomic features, has an impact on the substitution pattern. Further, we suggest that this phenomenon is particularly visible in avian genomes due to their unusually low rate of chromosomal evolution. Because of this, interrelations between GC content and other genomic features are being reinforced, and are as such more pronounced in avian genomes as compared with other vertebrate genomes with a less stable karyotype.
Collapse
Affiliation(s)
- Carina F Mugal
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | | | | |
Collapse
|
21
|
Schütze T, Wilhelm B, Greiner N, Braun H, Peter F, Mörl M, Erdmann VA, Lehrach H, Konthur Z, Menger M, Arndt PF, Glökler J. Probing the SELEX process with next-generation sequencing. PLoS One 2011; 6:e29604. [PMID: 22242135 PMCID: PMC3248438 DOI: 10.1371/journal.pone.0029604] [Citation(s) in RCA: 147] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 12/01/2011] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND SELEX is an iterative process in which highly diverse synthetic nucleic acid libraries are selected over many rounds to finally identify aptamers with desired properties. However, little is understood as how binders are enriched during the selection course. Next-generation sequencing offers the opportunity to open the black box and observe a large part of the population dynamics during the selection process. METHODOLOGY We have performed a semi-automated SELEX procedure on the model target streptavidin starting with a synthetic DNA oligonucleotide library and compared results obtained by the conventional analysis via cloning and Sanger sequencing with next-generation sequencing. In order to follow the population dynamics during the selection, pools from all selection rounds were barcoded and sequenced in parallel. CONCLUSIONS High affinity aptamers can be readily identified simply by copy number enrichment in the first selection rounds. Based on our results, we suggest a new selection scheme that avoids a high number of iterative selection rounds while reducing time, PCR bias, and artifacts.
Collapse
Affiliation(s)
- Tatjana Schütze
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Institute for Chemistry/Biochemistry, Free University Berlin, Berlin, Germany
| | - Barbara Wilhelm
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Nicole Greiner
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Alacris Theranostics GmbH, Berlin, Germany
| | - Hannsjörg Braun
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Alacris Theranostics GmbH, Berlin, Germany
| | - Franziska Peter
- Institute of Biochemistry, Universität Leipzig, Leipzig, Germany
| | - Mario Mörl
- Institute of Biochemistry, Universität Leipzig, Leipzig, Germany
| | - Volker A. Erdmann
- Institute for Chemistry/Biochemistry, Free University Berlin, Berlin, Germany
| | - Hans Lehrach
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Zoltán Konthur
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Marcus Menger
- RiNA RNA-Netzwerk Technologien GmbH, Berlin, Germany
| | - Peter F. Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Jörn Glökler
- Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, Germany
- Alacris Theranostics GmbH, Berlin, Germany
- * E-mail:
| |
Collapse
|
22
|
Zemojtel T, Kielbasa SM, Arndt PF, Behrens S, Bourque G, Vingron M. CpG deamination creates transcription factor-binding sites with high efficiency. Genome Biol Evol 2011; 3:1304-11. [PMID: 22016335 PMCID: PMC3228489 DOI: 10.1093/gbe/evr107] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
The formation of new transcription factor–binding sites (TFBSs) has a major impact on the evolution of gene regulatory networks. Clearly, single nucleotide mutations arising within genomic DNA can lead to the creation of TFBSs. Are molecular processes inducing single nucleotide mutations contributing equally to the creation of TFBSs? In the human genome, a spontaneous deamination of methylated cytosine in the context of CpG dinucleotides results in the creation of thymine (C → T), and this mutation has the highest rate among all base substitutions. CpG deamination has been ascribed a role in silencing of transposons and induction of variation in regional methylation. We have previously shown that CpG deamination created thousands of p53-binding sites within genomic sequences of Alu transposons. Interestingly, we have defined a ∼30 bp region in Alu sequence, which, depending on a pattern of CpG deamination, can be converted to functional p53-, PAX-6-, and Myc-binding sites. Here, we have studied single nucleotide mutational events leading to creation of TFBSs in promoters of human genes and in genomic regions bound by such key transcription factors as Oct4, NANOG, and c-Myc. We document that CpG deamination events can create TFBSs with much higher efficiency than other types of mutational events. Our findings add a new role to CpG methylation: We propose that deamination of methylated CpGs constitutes one of the evolutionary forces acting on mutational trajectories of TFBSs formation contributing to variability in gene regulation.
Collapse
Affiliation(s)
- Tomasz Zemojtel
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | | | | | |
Collapse
|
23
|
Cusack BP, Arndt PF, Duret L, Roest Crollius H. Preventing dangerous nonsense: selection for robustness to transcriptional error in human genes. PLoS Genet 2011; 7:e1002276. [PMID: 22022272 PMCID: PMC3192821 DOI: 10.1371/journal.pgen.1002276] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2011] [Accepted: 07/12/2011] [Indexed: 11/19/2022] Open
Abstract
Nonsense Mediated Decay (NMD) degrades transcripts that contain a premature STOP codon resulting from mistranscription or missplicing. However NMD's surveillance of gene expression varies in efficiency both among and within human genes. Previous work has shown that the intron content of human genes is influenced by missplicing events invisible to NMD. Given the high rate of transcriptional errors in eukaryotes, we hypothesized that natural selection has promoted a dual strategy of “prevention and cure” to alleviate the problem of nonsense transcriptional errors. A prediction of this hypothesis is that NMD's inefficiency should leave a signature of “transcriptional robustness” in human gene sequences that reduces the frequency of nonsense transcriptional errors. For human genes we determined the usage of “fragile” codons, prone to mistranscription into STOP codons, relative to the usage of “robust” codons that do not generate nonsense errors. We observe that single-exon genes have evolved to become robust to mistranscription, because they show a significant tendency to avoid fragile codons relative to robust codons when compared to multi-exon genes. A similar depletion is evident in last exons of multi-exon genes. Histone genes are particularly depleted of fragile codons and thus highly robust to transcriptional errors. Finally, the protein products of single-exon genes show a strong tendency to avoid those amino acids that can only be encoded using fragile codons. Each of these observations can be attributed to NMD deficiency. Thus, in the human genome, wherever the “cure” for nonsense (i.e. NMD) is inefficient, there is increased reliance on the strategy of nonsense “prevention” (i.e. transcriptional robustness). This study shows that human genes are exposed to the deleterious influence of transcriptional errors. Moreover, it suggests that gene expression errors are an underestimated phenomenon, in molecular evolution in general and in selection for genomic robustness in particular. In biological systems mistakes are made constantly because the cellular machinery is complex and error-prone. Mistakes are made in copying DNA for transmission to offspring (“genetic mutations”) but are much more frequent in the day-to-day task of gene expression. Genetic mutations are heritable and therefore have long been the almost exclusive focus of evolutionary genetics research. In contrast, gene expression errors are not inherited and have tended to be disregarded in evolutionary studies. Here we show how human genes have evolved a mechanism to reduce the occurrence of a specific type of gene expression error—transcriptional errors that create premature STOP codons (so-called “nonsense errors”). Nonsense errors are potentially highly toxic for the cell, so natural selection has evolved a strategy called Nonsense Mediated Decay (NMD) to “cure” such errors. However this cure is inefficient. Here we describe how a preventative strategy of “transcriptional robustness” has evolved to decrease the frequency of nonsense errors. Moreover, these “prevention and cure” strategies are used interchangeably—the most transcriptionally robust genes are those for which NMD is most inefficient. Our work implies that gene expression errors play an important role as supporting actors to genetic mutations in molecular evolution, particularly in the evolution of robustness.
Collapse
Affiliation(s)
- Brian P Cusack
- Max Planck Institute for Molecular Genetics, Department of Computational Molecular Biology, Berlin, Germany.
| | | | | | | |
Collapse
|
24
|
Abstract
There are large-scale variations of the GC-content along mammalian chromosomes that have been called isochore structures. Primates and rodents have different isochore structures, which suggests that these lineages exhibit different modes of GC-content evolution. It has been shown that, in the human lineage, GC-biased gene conversion (gBGC), a neutral process associated with meiotic recombination, acts on GC-content evolution by influencing A or T to G or C substitution rates. We computed genome-wide substitution patterns in the mouse lineage from multiple alignments and compared them with substitution patterns in the human lineage. We found that in the mouse lineage, gBGC is active but weaker than in the human lineage and that male-specific recombination better predicts GC-content evolution than female-specific recombination. Furthermore, we were able to show that G or C to A or T substitution rates are predicted by a combination of different factors in both lineages. A or T to G or C substitution rates are most strongly predicted by meiotic recombination in the human lineage but by CpG odds ratio (the observed CpG frequency normalized by the expected CpG frequency) in the mouse lineage, suggesting that substitution patterns are under different influences in primates and rodents.
Collapse
Affiliation(s)
- Yves Clément
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | |
Collapse
|
25
|
Polak P, Querfurth R, Arndt PF. The evolution of transcription-associated biases of mutations across vertebrates. BMC Evol Biol 2010; 10:187. [PMID: 20565875 PMCID: PMC2927911 DOI: 10.1186/1471-2148-10-187] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2009] [Accepted: 06/18/2010] [Indexed: 02/03/2024] Open
Abstract
Background The interplay between transcription and mutational processes can lead to particular mutation patterns in transcribed regions of the genome. Transcription introduces several biases in mutational patterns; in particular it invokes strand specific mutations. In order to understand the forces that have shaped transcripts during evolution, one has to study mutation patterns associated with transcription across animals. Results Using multiple alignments of related species we estimated the regional single-nucleotide substitution patterns along genes in four vertebrate taxa: primates, rodents, laurasiatheria and bony fishes. Our analysis is focused on intronic and intergenic regions and reveals differences in the patterns of substitution asymmetries between mammals and fishes. In mammals, the levels of asymmetries are stronger for genes starting within CpG islands than in genes lacking this property. In contrast to all other species analyzed, we found a mutational pressure in dog and stickleback, promoting an increase of GC-contents in the proximity to transcriptional start sites. Conclusions We propose that the asymmetric patterns in transcribed regions are results of transcription associated mutagenic processes and transcription coupled repair, which both seem to evolve in a taxon related manner. We also discuss alternative mechanisms that can generate strand biases and involves error prone DNA polymerases and reverse transcription. A localized increase of the GC content near the transcription start site is a signature of biased gene conversion (BGC) that occurs during recombination and heteroduplex formation. Since dog and stickleback are known to be subject to rapid adaptations due to population bottlenecks and breeding, we further hypothesize that an increase in recombination rates near gene starts has been part of an adaptive process.
Collapse
Affiliation(s)
- Paz Polak
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | |
Collapse
|
26
|
Schütze T, Arndt PF, Menger M, Wochner A, Vingron M, Erdmann VA, Lehrach H, Kaps C, Glökler J. A calibrated diversity assay for nucleic acid libraries using DiStRO--a Diversity Standard of Random Oligonucleotides. Nucleic Acids Res 2009; 38:e23. [PMID: 19965765 PMCID: PMC2831324 DOI: 10.1093/nar/gkp1108] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
We have determined diversities exceeding 1012 different sequences in an annealing and melting assay using synthetic randomized oligonucleotides as a standard. For such high diversities, the annealing kinetics differ from those observed for low diversities, favouring the remelting curve after annealing as the best indicator of complexity. Direct comparisons of nucleic acid pools obtained from an aptamer selection demonstrate that even highly complex populations can be evaluated by using DiStRO, without the need of complicated calculations.
Collapse
Affiliation(s)
- Tatjana Schütze
- Institute for Chemistry/Biochemistry, Free University Berlin, Thielallee 63, Berlin, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Abstract
In the human genome, CpG islands (CGIs), which are GC- and CpG-rich sequences, are associated with transcription starting sites (TSSs); in addition, there is evidence that CGIs harbor origins of bidirectional replication (OBRs) and are preferred sites for heteroduplex formation during recombination. Transcription, replication, and recombination processes are known to induce specific mutational patterns in various genomes, and therefore, these patterns are expected to be found around CGIs. We use triple alignments of human, chimp, and macaque to compute the rates of nucleotide substitutions in up to 1 Mbps long intergenic regions on both sides of CGIs. Our analysis revealed that around a CGI there is an asymmetry between complementary substitution rates that is similar to the one that found around the OBR in bacteria. We hypothesize that these asymmetries are induced by differences in the replication of the leading and lagging strand and that a significant number of CGIs overlap OBRs. Within CGIs, we observed a mutational signature of GC-biased gene conversion that is associated with recombination. We suggest that recombination has played a major role in the creation of CGIs.
Collapse
Affiliation(s)
- Paz Polak
- Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | |
Collapse
|
28
|
Singh ND, Arndt PF, Clark AG, Aquadro CF. Strong evidence for lineage and sequence specificity of substitution rates and patterns in Drosophila. Mol Biol Evol 2009; 26:1591-605. [PMID: 19351792 DOI: 10.1093/molbev/msp071] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Rates of single nucleotide substitution in Drosophila are highly variable within the genome, and several examples illustrate that evolutionary rates differ among Drosophila species as well. Here, we use a maximum likelihood method to quantify lineage-specific substitutional patterns and apply this method to 4-fold degenerate synonymous sites and introns from more than 8,000 genes aligned in the Drosophila melanogaster group. We find that within species, different classes of sequence evolve at different rates, with long introns evolving most slowly and short introns evolving most rapidly. Relative rates of individual single nucleotide substitutions vary approximately 3-fold among lineages, yielding patterns of substitution that are comparatively less GC-biased in the melanogaster species complex relative to Drosophila yakuba and Drosophila erecta. These results are consistent with a model coupling a mutational shift toward reduced GC content, or a shift in mutation-selection balance, in the D. melanogaster species complex, with variation in selective constraint among different classes of DNA sequence. Finally, base composition of coding and intronic sequences is not at equilibrium with respect to substitutional patterns, which primarily reflects the slow rate of the substitutional process. These results thus support the view that mutational and/or selective processes are labile on an evolutionary timescale and that if the process is indeed selection driven, then the distribution of selective constraint is variable across the genome.
Collapse
Affiliation(s)
- Nadia D Singh
- Department of Molecular Biology and Genetics, Cornell University.
| | | | | | | |
Collapse
|
29
|
Zemojtel T, Kielbasa SM, Arndt PF, Chung HR, Vingron M. Methylation and deamination of CpGs generate p53-binding sites on a genomic scale. Trends Genet 2008; 25:63-6. [PMID: 19101055 DOI: 10.1016/j.tig.2008.11.005] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2008] [Revised: 11/19/2008] [Accepted: 11/20/2008] [Indexed: 11/28/2022]
Abstract
The formation of transcription-factor-binding sites is an important evolutionary process. Here, we show that methylation and deamination of CpG dinucleotides generate in vivo p53-binding sites in numerous Alu elements and in non-repetitive DNA in a species-specific manner. In light of this, we propose that the deamination of methylated CpGs constitutes a universal mechanism for de novo generation of various transcription-factor-binding sites in Alus.
Collapse
Affiliation(s)
- Tomasz Zemojtel
- Department of Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany.
| | | | | | | | | |
Collapse
|
30
|
Kübler K, Arndt PF, Wardelmann E, Landwehr C, Krebs D, Kuhn W, van der Ven K. Genetic alterations of HLA-class II in ovarian cancer. Int J Cancer 2008; 123:1350-6. [PMID: 18561316 DOI: 10.1002/ijc.23624] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
The immune system controls tumor formation through identification and elimination of cellular alterations. Consequently, cancer development in immune competent hosts depends on strategies to evade the immune system. Modulation of tumor antigen-specific immune responses by aberrant expression of HLA-class I and II molecules is well documented in a variety of carcinomas including ovarian cancer. To date, little data are available about molecular mechanisms responsible for altered HLA-class II phenotypes in tumors. In our sample of 10 Caucasian patients with ovarian carcinoma, a semiquantitative analysis was performed for HLA-class II loci DRB1 and DQB1 in malignant and normal ovarian tissue. Gene amplifications were identified in 62.5% of analyzed alleles and deletions in 17.5%, demonstrating that genomic aberrations of 6p21.3 are common and that copy number gain is more frequent than loss. Moreover, amplifications are most pronounced in advanced-stage tumors. To evaluate genotype-phenotype relation, immunohistochemical analyses were performed and revealed de novo expression of HLA-class II in 30% of tumors with an inverse association between antigen level and HLA copy number. It remains to be elucidated whether the profound changes of the latter quantities are the result of the host's immunological self-defense, indicate the presence of an oncogene located within the MHC-complex or merely reflect the increasing loss of differentiation of the tumor tissue.
Collapse
Affiliation(s)
- Kirsten Kübler
- Department of Obstetrics and Gynecology, University of Bonn, Sigmund Freud Strasse 25, 53127 Bonn, Germany
| | | | | | | | | | | | | |
Collapse
|
31
|
Abstract
Markov models describing the evolution of the nucleotide substitution process, widely used in phylogeny reconstruction, usually assume the hypotheses of stationarity and time reversibility. Although these models give meaningful results when applied to biological data, it is not clear if the 2 assumptions mentioned above hold and, if not, how much sequence evolution processes deviate from them. To this aim, we introduce 2 sets of indices that can be calculated from the nucleotide distribution and the substitution rates. The stationarity indices (STIs) can be used to test the validity of the equilibrium assumption. The irreversibility indices (IRIs) are derived from the Kolmogorov cycle conditions for time reversibility and quantify the degree of nontime reversibility of a process. We have computed STIs and IRIs for the evolutionary process of 2 lineages, Drosophila simulans and Homo sapiens. In the latter case, we use a modified form of the indices that takes into account the CpG decay process. In both cases, we find statistically significant deviations from the ideal case of a process that has reached stationarity and is time reversible.
Collapse
Affiliation(s)
- Federico Squartini
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | |
Collapse
|
32
|
Abstract
Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue for detecting selection within sequences. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We analyzed the pattern of neutral substitutions in 1 Gb of primate non-coding regions. We show that the GC-content toward which sequences are evolving is strongly negatively correlated to the distance to telomeres and positively correlated to the rate of crossovers (R2 = 47%). This demonstrates that recombination has a major impact on substitution patterns in human, driving the evolution of GC-content. The evolution of GC-content correlates much more strongly with male than with female crossover rate, which rules out selectionist models for the evolution of isochores. This effect of recombination is most probably a consequence of the neutral process of biased gene conversion (BGC) occurring within recombination hotspots. We show that the predictions of this model fit very well with the observed substitution patterns in the human genome. This model notably explains the positive correlation between substitution rate and recombination rate. Theoretical calculations indicate that variations in population size or density in recombination hotspots can have a very strong impact on the evolution of base composition. Furthermore, recombination hotspots can create strong substitution hotspots. This molecular drive affects both coding and non-coding regions. We therefore conclude that along with mutation, selection and drift, BGC is one of the major factors driving genome evolution. Our results also shed light on variations in the rate of crossover relative to non-crossover events, along chromosomes and according to sex, and also on the conservation of hotspot density between human and chimp. Mammalian genomes show a very strong heterogeneity of base composition along chromosomes (the so-called isochores). The functional significance of these peculiar genomic landscapes is highly debated: do isochores confer some selective advantage, or are they simply the by-product of neutral evolutionary processes? To resolve this issue, we analyzed the pattern of substitution in the human genome by comparison with chimpanzee and macaque. We show that the evolution of base composition (GC-content) is essentially determined by the rate of recombination. This effect appears to be much stronger in male than in female germline, which rules out selective explanations for the evolution of isochores. We show that this impact of recombination is most probably a consequence of the process of biased gene conversion (BGC). This neutral process mimics the action of selection and can induce strong substitution hotspots within recombination hotspots, sometimes leading to the fixation of deleterious mutations. BGC appears to be one of the major factors driving genome evolution. It is therefore essential to take this process into account if we want to be able to interpret genome sequences.
Collapse
Affiliation(s)
- Laurent Duret
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR 5558, Villeurbanne, France
- * E-mail: (LD); (PFA)
| | - Peter F. Arndt
- Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
- * E-mail: (LD); (PFA)
| |
Collapse
|
33
|
Arndt PF, Vingron M. The Otto Warburg International Summer School and Workshop on Networks and Regulation. BMC Bioinformatics 2007. [PMCID: PMC1995547 DOI: 10.1186/1471-2105-8-s6-s1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
34
|
de la Chaux N, Messer PW, Arndt PF. DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage. BMC Evol Biol 2007; 7:191. [PMID: 17935613 PMCID: PMC2151769 DOI: 10.1186/1471-2148-7-191] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2007] [Accepted: 10/12/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Insertions and deletions of DNA segments (indels) are together with substitutions the major mutational processes that generate genetic variation. Here we focus on recent DNA insertions and deletions in protein coding regions of the human genome to investigate selective constraints on indels in protein evolution. RESULTS Frequencies of inserted and deleted amino acids differ from background amino acid frequencies in the human proteome. Small amino acids are overrepresented, while hydrophobic, aliphatic and aromatic amino acids are strongly suppressed. Indels are found to be preferentially located in protein regions that do not form important structural domains. Amino acid insertion and deletion rates in genes associated with elementary biochemical reactions (e. g. catalytic activity, ligase activity, electron transport, or catabolic process) are lower compared to those in other genes and are therefore subject to stronger purifying selection. CONCLUSION Our analysis indicates that indels in human protein coding regions are subject to distinct levels of selective pressure with regard to their structural impact on the amino acid sequence, as well as to general properties of the genes they are located in. These findings confirm that many commonly accepted characteristics of selective constraints for substitutions are also valid for amino acid insertions and deletions.
Collapse
Affiliation(s)
- Nicole de la Chaux
- Department for Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Ihnestr, 63-73, 14195 Berlin, Germany.
| | | | | |
Collapse
|
35
|
Abstract
Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context.
Collapse
|
36
|
Abstract
Nucleotide substitutions, insertions, and deletions constitute the principal molecular mechanisms generating genetic variation on small length scales. In contrast to substitutions, the nature of short DNA insertions and deletions (indels) is far less understood. With the recent availability of whole-genome multiple alignments between human and other primates, detailed investigations on indel characteristics and origin have come within reach. Here, we show that the majority of short (1-100 bp) DNA insertions in the human lineage are tandem duplications of directly adjacent sequence segments with conserved polarity. Indels in microsatellites comprise only a small fraction. The underlying molecular processes generating indels do not necessarily rely on the presence of preexisting duplicates, as would be expected for unequal crossing over, as well as replication slippage. Instead, our findings point toward a mechanism that preferentially occurs in the male germline and is not recombination-mediated. Surprisingly, nonframeshifting tandem duplications and deletions in coding regions still occur at approximately 50% of their genomic background rates. As is already well established in the context of gene and segmental duplications, our results demonstrate that duplications are also likely to constitute the predominant process for rapid generation of new genetic material and function on smaller scales.
Collapse
|
37
|
Arndt PF. Reconstruction of ancestral nucleotide sequences and estimation of substitution frequencies in a star phylogeny. Gene 2006; 390:75-83. [PMID: 17223282 DOI: 10.1016/j.gene.2006.11.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2006] [Revised: 11/15/2006] [Accepted: 11/15/2006] [Indexed: 10/23/2022]
Abstract
Maximum likelihood phylogeny reconstruction methods are widely used in uncovering and assessing the evolutionary history and relationships of natural systems. However, several simplifying assumptions commonly made in this analysis limit the explanatory power of the results obtained. We present an algorithm that performs the phylogenetic analysis without making the common assumptions for sequence data from at least three leaf nodes in a star phylogeny. In particular, the underlying nucleotide substitution model does not have to be reversible and may include neighbor-dependent processes like the CpG methylation deamination process (CpG-effect). The base composition of the sequences at the external nodes and the one of the ancestral sequence may be different from each other and they do not have to be stationary state distributions of the corresponding substitution model. The algorithm is able to reconstruct the ancestral base composition and accurately estimate substitution frequencies in the branches of the star phylogeny. Extensive tests on simulated data validate the very favorable performance of the algorithm. As an application we present the analysis of aligned genomic sequences from human, mouse, and dog. Different substitution pattern can be observed in the three lineages.
Collapse
Affiliation(s)
- Peter F Arndt
- Max Planck Institute for Molecular Genetics, Ihnestr. 63, 14195 Berlin, Germany.
| |
Collapse
|
38
|
Singh ND, Arndt PF, Petrov DA. Minor shift in background substitutional patterns in the Drosophila saltans and willistoni lineages is insufficient to explain GC content of coding sequences. BMC Biol 2006; 4:37. [PMID: 17049096 PMCID: PMC1626080 DOI: 10.1186/1741-7007-4-37] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2006] [Accepted: 10/18/2006] [Indexed: 11/10/2022] Open
Abstract
Background Several lines of evidence suggest that codon usage in the Drosophila saltans and D. willistoni lineages has shifted towards a less frequent use of GC-ending codons. Introns in these lineages show a parallel shift toward a lower GC content. These patterns have been alternatively ascribed to either a shift in mutational patterns or changes in the definition of preferred and unpreferred codons in these lineages. Results and discussion To gain additional insight into this question, we quantified background substitutional patterns in the saltans/willistoni group using inactive copies of a novel, Q-like retrotransposable element. We demonstrate that the pattern of background substitutions in the saltans/willistoni lineage has shifted to a significant degree, primarily due to changes in mutational biases. These differences predict a lower equilibrium GC content in the genomes of the saltans/willistoni species compared with that in the D. melanogaster species group. The magnitude of the difference can readily account for changes in intronic GC content, but it appears insufficient to explain changes in codon usage within the saltans/willistoni lineage. Conclusion We suggest that the observed changes in codon usage in the saltans/willistoni clade reflects either lineage-specific changes in the definitions of preferred and unpreferred codons, or a weaker selective pressure on codon bias in this lineage.
Collapse
Affiliation(s)
- Nadia D Singh
- Department of Biological Sciences, Stanford University, 371 Serra Mall, Stanford, CA 94305, USA
| | - Peter F Arndt
- Max Planck for Molecular Genetics, 14195 Berlin, Germany
| | - Dmitri A Petrov
- Department of Biological Sciences, Stanford University, 371 Serra Mall, Stanford, CA 94305, USA
| |
Collapse
|
39
|
Abstract
CorGen is a web server that measures long-range correlations in the base composition of DNA and generates random sequences with the same correlation parameters. Long-range correlations are characterized by a power-law decay of the auto correlation function of the GC-content. The widespread presence of such correlations in eukaryotic genomes calls for their incorporation into accurate null models of eukaryotic DNA in computational biology. For example, the score statistics of sequence alignment and the performance of motif finding algorithms are significantly affected by the presence of genomic long-range correlations. We use an expansion-randomization dynamics to efficiently generate the correlated random sequences. The server is available at http://corgen.molgen.mpg.de.
Collapse
Affiliation(s)
- Philipp W Messer
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany.
| | | |
Collapse
|
40
|
Lipatov M, Arndt PF, Hwa T, Petrov DA. A Novel Method Distinguishes Between Mutation Rates and Fixation Biases in Patterns of Single-Nucleotide Substitution. J Mol Evol 2006. [DOI: 10.1007/s00239-006-7207-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
41
|
Roepcke S, Zhi D, Vingron M, Arndt PF. Identification of highly specific localized sequence motifs in human ribosomal protein gene promoters. Gene 2006; 365:48-56. [PMID: 16343812 DOI: 10.1016/j.gene.2005.09.033] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2005] [Revised: 07/22/2005] [Accepted: 09/27/2005] [Indexed: 11/28/2022]
Abstract
For ribosomal protein (RP) genes the start of transcription is rigidly controlled to maintain the 5'-TOP signal on the messenger RNA. The responsible regulatory mechanism is not yet fully understood. Careful comparative analysis of their proximal promoter sequences reveals common characteristics and thus provides clues to the underlying mechanism. We have extracted the proximal promoters of the 80 human cytosolic ribosomal protein genes together with the orthologous mouse sequences. After annotating the set with transcription factor binding sites based on the available literature, we searched for over-represented sequence motifs. We uncovered a novel motif that is localized at a fixed distance downstream to the transcription start. 31 out of the 80 promoters contain the motif in the same orientation around position +62 (standard deviation 6). A second evolutionary conserved and palindromic motif is found 13 times in the RP promoter set, 9 instances of which are located upstream around position -40. In addition, we see a characteristic profile of the GC-content and of the CpG dinucleotide frequencies. Our results support a model for the transcription of ribosomal protein genes in which the maintenance of the accurate start of transcription is provided by specific transcription factors. Such a factor binds the target DNA at a fixed location relative to the TSS, and possibly interacts directly with the basal transcription machinery.
Collapse
Affiliation(s)
- Stefan Roepcke
- Max Planck Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany.
| | | | | | | |
Collapse
|
42
|
Abstract
The development of cancer is a multistep process that is characterized by the accumulation of genetic alterations in cells and changed cellular interactions with the surrounding healthy tissues. The human immune system is believed to be intrinsically involved in this process. The correlation of certain human leukocyte antigen (HLA)-class I and II haplotypes with tumorigenesis is documented in a variety of tumors. However, few data exist on the possible association of specific HLA-class II alleles or haplotypes with ovarian cancer. In our sample of 52 Caucasian patients with primary ovarian carcinoma and 239 female healthy local controls, we observed a significantly increased incidence of the HLA-class II haplotypes DRB1*0301 - DQA1*0501 - DQB1*0201 (p < 0.001) and DRB1*1001 - DQA1*0101 - DQB1*0501 (p < 0.001) in the patients. Our data suggest that HLA-class II loci or individual HLA-class II haplotypes may be involved in the pathogenesis of ovarian cancer.
Collapse
Affiliation(s)
- Kirsten Kübler
- Department of Obstetrics and Gynecology, University of Bonn, Sigmund Freud Strasse 25, 53127 Bonn, Germany
| | | | | | | | | | | |
Collapse
|
43
|
Lipatov M, Arndt PF, Hwa T, Petrov DA. A Novel Method Distinguishes Between Mutation Rates and Fixation Biases in Patterns of Single-Nucleotide Substitution. J Mol Evol 2005; 62:168-75. [PMID: 16362483 DOI: 10.1007/s00239-005-0207-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2004] [Accepted: 06/20/2005] [Indexed: 10/25/2022]
Abstract
Analysis of the genome-wide patterns of single-nucleotide substitution reveals that the human GC content structure is out of equilibrium. The substitutions are decreasing the overall GC content (GC), at the same time making its range narrower. Investigation of single-nucleotide polymorphisms (SNPs) revealed that presently the decrease in GC content is due to a uniform mutational preference for A:T pairs, while its projected range is due to a variability in the fixation preference for G:C pairs. However, it is important to determine whether lessons learned about evolutionary processes operating at the present time (that is reflected in the SNP data) can be extended back into the evolutionary past. We describe here a new approach to this problem that utilizes the juxtaposition of forward and reverse substitution rates to determine the relative importance of variability in mutation rates and fixation probabilities in shaping long-term substitutional patterns. We use this approach to demonstrate that the forces shaping GC content structure over the recent past (since the appearance of the SNPs) extend all the way back to the mammalian radiation approximately 90 million years ago. In addition, we find a small but significant effect that has not been detected in the SNP data-relatively high rates of C:G-->A:T germline mutation in low-GC regions of the genome.
Collapse
Affiliation(s)
- Mikhail Lipatov
- Department of Biological Sciences, Stanford University, 371 Serra Mall, Stanford, CA 94305, USA.
| | | | | | | |
Collapse
|
44
|
Arndt PF, Hwa T, Petrov DA. Substantial regional variation in substitution rates in the human genome: importance of GC content, gene density, and telomere-specific effects. J Mol Evol 2005; 60:748-63. [PMID: 15959677 DOI: 10.1007/s00239-004-0222-5] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2004] [Accepted: 12/30/2004] [Indexed: 01/08/2023]
Abstract
This study presents the first global, 1-Mbp-level analysis of patterns of nucleotide substitutions along the human lineage. The study is based on the analysis of a large amount of repetitive elements deposited into the human genome since the mammalian radiation, yielding a number of results that would have been difficult to obtain using the more conventional comparative method of analysis. This analysis revealed substantial and consistent variability of rates of substitution, with the variability ranging up to twofold among different regions. The rates of substitutions of C or G nucleotides with A or T nucleotides vary much more sharply than the reverse rates, suggesting that much of that variation is due to differences in mutation rates rather than in the probabilities of fixation of C/G vs. A/T nucleotides across the genome. For all types of substitution we observe substantially more hotspots than coldspots, with hotspots showing substantial clustering over tens of Mbp's. Our analysis revealed that GC-content of surrounding sequences is the best predictor of the rates of substitution. The pattern of substitution appears very different near telomeres compared to the rest of the genome and cannot be explained by the genome-wide correlations of the substitution rates with GC content or exon density. The telomere pattern of substitution is consistent with natural selection or biased gene conversion acting to increase the GC-content of the sequences that are within 10-15 Mbp away from the telomere.
Collapse
Affiliation(s)
- Peter F Arndt
- Max Planck Institute for Molecular Genetics, Ihnestr. 73, Berlin 14195, Germany.
| | | | | |
Collapse
|
45
|
Abstract
We study a minimal model for genome evolution whose elementary processes are single site mutation, duplication and deletion of sequence regions, and insertion of random segments. These processes are found to generate long-range correlations in the composition of letters as long as the sequence length is growing; i.e., the combined rates of duplications and insertions are higher than the deletion rate. For constant sequence length, on the other hand, all initial correlations decay exponentially. These results are obtained analytically and by simulations. They are compared with the long-range correlations observed in genomic DNA, and the implications for genome evolution are discussed.
Collapse
Affiliation(s)
- Philipp W Messer
- Institute for Theoretical Physics, University of Cologne, Köln, Germany
| | | | | |
Collapse
|
46
|
Webster MT, Smith NGC, Hultin-Rosenberg L, Arndt PF, Ellegren H. Male-driven biased gene conversion governs the evolution of base composition in human alu repeats. Mol Biol Evol 2005; 22:1468-74. [PMID: 15772377 DOI: 10.1093/molbev/msi136] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Regional biases in substitution pattern are likely to be responsible for the large-scale variation in base composition observed in vertebrate genomes. However, the evolutionary forces responsible for these biases are still not clearly defined. In order to study the processes of mutation and fixation across the entire human genome, we analyzed patterns of substitution in Alu repeats since their insertion. We also studied patterns of human polymorphism within the repeats. There is a highly significant effect of recombination rate on the pattern of substitution, whereas no such effect is seen on the pattern of polymorphism. These results suggest that regional biases in substitution are caused by biased gene conversion, a process that increases the probability of fixation of mutations that increase GC content. Furthermore, the strongest correlate of substitution patterns is found to be male recombination rates rather than female or sex-averaged recombination rates. This indicates that in addition to sexual dimorphism in recombination rates, the sexes also differ in the relative rates of crossover and gene conversion.
Collapse
Affiliation(s)
- Matthew T Webster
- Department of Evolution, Genomics and Systematics, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
47
|
Abstract
MOTIVATION Neighbor-dependent substitution processes generated specific pattern of dinucleotide frequencies in the genomes of most organisms. The CpG-methylation-deamination process is, e.g. a prominent process in vertebrates (CpG effect). Such processes, often with unknown mechanistic origins, need to be incorporated into realistic models of nucleotide substitutions. RESULTS Based on a general framework of nucleotide substitutions we developed a method that is able to identify the most relevant neighbor-dependent substitution processes, estimate their relative frequencies and judge their importance in order to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively added neighbor-dependent substitution processes in the order of their ability to increase the likelihood of the model describing given data. The analysis of neighbor-dependent nucleotide substitutions based on repetitive elements found in the genomes of human, zebrafish and fruit fly is presented. AVAILABILITY A web server to perform the presented analysis is freely available at: http://evogen.molgen.mpg.de/server/substitution-analysis
Collapse
Affiliation(s)
- Peter F Arndt
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany.
| | | |
Collapse
|
48
|
Dieterich C, Grossmann S, Tanzer A, Röpcke S, Arndt PF, Stadler PF, Vingron M. Comparative promoter region analysis powered by CORG. BMC Genomics 2005; 6:24. [PMID: 15723697 PMCID: PMC555765 DOI: 10.1186/1471-2164-6-24] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2004] [Accepted: 02/21/2005] [Indexed: 11/10/2022] Open
Abstract
Background Promoters are key players in gene regulation. They receive signals from various sources (e.g. cell surface receptors) and control the level of transcription initiation, which largely determines gene expression. In vertebrates, transcription start sites and surrounding regulatory elements are often poorly defined. To support promoter analysis, we present CORG , a framework for studying upstream regions including untranslated exons (5' UTR). Description The automated annotation of promoter regions integrates information of two kinds. First, statistically significant cross-species conservation within upstream regions of orthologous genes is detected. Pairwise as well as multiple sequence comparisons are computed. Second, binding site descriptions (position-weight matrices) are employed to predict conserved regulatory elements with a novel approach. Assembled EST sequences and verified transcription start sites are incorporated to distinguish exonic from other sequences. As of now, we have included 5 species in our analysis pipeline (man, mouse, rat, fugu and zebrafish). We characterized promoter regions of 16,127 groups of orthologous genes. All data are presented in an intuitive way via our web site. Users are free to export data for single genes or access larger data sets via our DAS server . The benefits of our framework are exemplarily shown in the context of phylogenetic profiling of transcription factor binding sites and detection of microRNAs close to transcription start sites of our gene set. Conclusion The CORG platform is a versatile tool to support analyses of gene regulation in vertebrate promoter regions. Applications for CORG cover a broad range from studying evolution of DNA binding sites and promoter constitution to the discovery of new regulatory sequence elements (e.g. microRNAs and binding sites).
Collapse
Affiliation(s)
- Christoph Dieterich
- Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | - Steffen Grossmann
- Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | - Andrea Tanzer
- Institute for Theoretical Chemistry and Structural Biology, University of Vienna, Währingerstrasse 17, A-1090 Wien, Austria
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Kreuzstraße 7b, D-04103 Leipzig, Germany
| | - Stefan Röpcke
- Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | - Peter F Arndt
- Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | - Peter F Stadler
- Institute for Theoretical Chemistry and Structural Biology, University of Vienna, Währingerstrasse 17, A-1090 Wien, Austria
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Kreuzstraße 7b, D-04103 Leipzig, Germany
| | - Martin Vingron
- Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| |
Collapse
|
49
|
Abstract
MOTIVATION Substantial regional variations of substitutional processes have recently been reported from human/mouse comparisons. However, several features including the C + G dependence and the CpG-based transition effect remain obscure. RESULTS Utilizing the vast amount of transposable elements in the human genome, we performed detailed analysis of the substitutional and insertion/deletion patterns along the human lineage in a regional and time-resolved fashion. We observed a drastic increase in the CpG-based transition frequency at about the time of the mammalian radiation. We also observed clear regional biases of substitution patterns, most notably a bias to enrich the C+G content toward the telomeres. AVAILABILITY The programs used are available upon request from the authors.
Collapse
Affiliation(s)
- Peter F Arndt
- Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany.
| | | |
Collapse
|
50
|
Abstract
Mutation is the underlying force that provides the variation upon which evolutionary forces can act. It is important to understand how mutation rates vary within genomes and how the probabilities of fixation of new mutations vary as well. If substitutional processes across the genome are heterogeneous, then examining patterns of coding sequence evolution without taking these underlying variations into account may be misleading. Here we present the first rigorous test of substitution rate heterogeneity in the Drosophila melanogaster genome using almost 1500 nonfunctional fragments of the transposable element DNAREP1_DM. Not only do our analyses suggest that substitutional patterns in heterochromatic and euchromatic sequences are different, but also they provide support in favor of a recombination-associated substitutional bias toward G and C in this species. The magnitude of this bias is entirely sufficient to explain recombination-associated patterns of codon usage on the autosomes of the D. melanogaster genome. We also document a bias toward lower GC content in the pattern of small insertions and deletions (indels). In addition, the GC content of noncoding DNA in Drosophila is higher than would be predicted on the basis of the pattern of nucleotide substitutions and small indels. However, we argue that the fast turnover of noncoding sequences in Drosophila makes it difficult to assess the importance of the GC biases in nucleotide substitutions and small indels in shaping the base composition of noncoding sequences.
Collapse
Affiliation(s)
- Nadia D Singh
- Department of Biological Sciences, Stanford University, Stanford, California 94305-5020, USA.
| | | | | |
Collapse
|