1
|
Jiao Y, Nigam D, Barry K, Daum C, Yoshinaga Y, Lipzen A, Khan A, Parasa SP, Wei S, Lu Z, Tello-Ruiz MK, Dhiman P, Burow G, Hayes C, Chen J, Brandizzi F, Mortimer J, Ware D, Xin Z. A large sequenced mutant library - valuable reverse genetic resource that covers 98% of sorghum genes. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 117:1543-1557. [PMID: 38100514 DOI: 10.1111/tpj.16582] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/08/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023]
Abstract
Mutant populations are crucial for functional genomics and discovering novel traits for crop breeding. Sorghum, a drought and heat-tolerant C4 species, requires a vast, large-scale, annotated, and sequenced mutant resource to enhance crop improvement through functional genomics research. Here, we report a sorghum large-scale sequenced mutant population with 9.5 million ethyl methane sulfonate (EMS)-induced mutations that covered 98% of sorghum's annotated genes using inbred line BTx623. Remarkably, a total of 610 320 mutations within the promoter and enhancer regions of 18 000 and 11 790 genes, respectively, can be leveraged for novel research of cis-regulatory elements. A comparison of the distribution of mutations in the large-scale mutant library and sorghum association panel (SAP) provides insights into the influence of selection. EMS-induced mutations appeared to be random across different regions of the genome without significant enrichment in different sections of a gene, including the 5' UTR, gene body, and 3'-UTR. In contrast, there were low variation density in the coding and UTR regions in the SAP. Based on the Ka /Ks value, the mutant library (~1) experienced little selection, unlike the SAP (0.40), which has been strongly selected through breeding. All mutation data are publicly searchable through SorbMutDB (https://www.depts.ttu.edu/igcast/sorbmutdb.php) and SorghumBase (https://sorghumbase.org/). This current large-scale sequence-indexed sorghum mutant population is a crucial resource that enriched the sorghum gene pool with novel diversity and a highly valuable tool for the Poaceae family, that will advance plant biology research and crop breeding.
Collapse
Affiliation(s)
- Yinping Jiao
- Department of Plant and Soil Science, Institute of Genomics for Crop Abiotic Stress Tolerance, Texas Tech University, Lubbock, Texas, 79409, USA
| | - Deepti Nigam
- Department of Plant and Soil Science, Institute of Genomics for Crop Abiotic Stress Tolerance, Texas Tech University, Lubbock, Texas, 79409, USA
| | - Kerrie Barry
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, 94720, USA
| | - Chris Daum
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, 94720, USA
| | - Yuko Yoshinaga
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, 94720, USA
| | - Anna Lipzen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, 94720, USA
| | - Adil Khan
- Department of Plant and Soil Science, Institute of Genomics for Crop Abiotic Stress Tolerance, Texas Tech University, Lubbock, Texas, 79409, USA
| | - Sai-Praneeth Parasa
- Department of Plant and Soil Science, Institute of Genomics for Crop Abiotic Stress Tolerance, Texas Tech University, Lubbock, Texas, 79409, USA
| | - Sharon Wei
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 11724, USA
| | - Zhenyuan Lu
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 11724, USA
| | | | - Pallavi Dhiman
- Department of Plant and Soil Science, Institute of Genomics for Crop Abiotic Stress Tolerance, Texas Tech University, Lubbock, Texas, 79409, USA
| | - Gloria Burow
- Plant Stress and Germplasm Development Unit, Crop Systems Research Laboratory, USDA-ARS, 3810, 4th Street, Lubbock, Texas, 79424, USA
| | - Chad Hayes
- Plant Stress and Germplasm Development Unit, Crop Systems Research Laboratory, USDA-ARS, 3810, 4th Street, Lubbock, Texas, 79424, USA
| | - Junping Chen
- Plant Stress and Germplasm Development Unit, Crop Systems Research Laboratory, USDA-ARS, 3810, 4th Street, Lubbock, Texas, 79424, USA
| | - Federica Brandizzi
- MSU-DOE Plant Research Lab, Michigan State University, East Lansing, Michigan, USA
- Great Lakes Bioenergy Research Center, Michigan State University, East Lansing, Michigan, USA
- Department of Plant Biology, Michigan State University, East Lansing, Michigan, USA
| | - Jenny Mortimer
- Joint BioEnergy Institute, Emeryville, California, 94608, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, 94720, USA
- School of Agriculture, Food and Wine, Waite Research Institute, Waite Research Precinct, University of Adelaide, Glen Osmond, South Australia, 5064, Australia
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 11724, USA
- USDA-ARS NAA Robert W. Holley Center for Agriculture and Health, Agricultural Research Service, Ithaca, New York, 14853, USA
| | - Zhanguo Xin
- Plant Stress and Germplasm Development Unit, Crop Systems Research Laboratory, USDA-ARS, 3810, 4th Street, Lubbock, Texas, 79424, USA
| |
Collapse
|
2
|
Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022; 23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open
Abstract
The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.
Collapse
|
3
|
Ryasik A, Orlov M, Zykova E, Ermak T, Sorokin A. Bacterial promoter prediction: Selection of dynamic and static physical properties of DNA for reliable sequence classification. J Bioinform Comput Biol 2018; 16:1840003. [DOI: 10.1142/s0219720018400036] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
Collapse
Affiliation(s)
- Artem Ryasik
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Mikhail Orlov
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Evgenia Zykova
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
- Department of Applied Research Informatization, State Institute of Information Technologies and Telecommunications (SIIT&T Informika), per. Brusov 21 st.2, Moscow, 125009, Russia
| | - Timofei Ermak
- Laboratory of Molecular Genetics Systems, Institute of Cytology and Genetics, pr. Akademika Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Anatoly Sorokin
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| |
Collapse
|
4
|
Triska M, Solovyev V, Baranova A, Kel A, Tatarinova TV. Nucleotide patterns aiding in prediction of eukaryotic promoters. PLoS One 2017; 12:e0187243. [PMID: 29141011 PMCID: PMC5687710 DOI: 10.1371/journal.pone.0187243] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 09/05/2017] [Indexed: 01/09/2023] Open
Abstract
Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into “promoters” and “non-promoters” even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 “promoter-specific” transcription factors), those that bind preferentially to the [0,500] region (282 “5′ UTR-specific” TFs), and 207 of the “promiscuous” transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.
Collapse
Affiliation(s)
- Martin Triska
- Children’s Hospital Los Angeles, University of Southern California, Los Angeles, CA, United States of America
- Faculty of Advanced Technology, University of South Wales, Pontypridd, Wales, United Kingdom
| | | | - Ancha Baranova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Research Centre for Medical Genetics, Moscow, Russia
| | - Alexander Kel
- geneXplain GmbH, Wolfenbuettel, Germany
- Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Tatiana V. Tatarinova
- School of Systems Biology, George Mason University, Fairfax, VA, United States of America
- Department of Biology, Division of Natural Sciences, University of La Verne, La Verne, CA, United States of America
- Bioinformatics Center, AA Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia
- Vavilov’s Institute for General Genetics, Moscow, Russia, Moscow, Russia
- * E-mail:
| |
Collapse
|
5
|
Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation. Mol Neurobiol 2017; 55:1871-1904. [PMID: 28233272 DOI: 10.1007/s12035-017-0427-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 01/26/2017] [Indexed: 01/31/2023]
Abstract
Adaptability to a variety of environmental conditions is a prominent feature of Homo sapiens. We hypothesize that this feature can be explained by evolutionary changes in gene promoters active in the brain prefrontal cortex leading to a more flexible gene regulation network. The genotype-dependent range of gene expression can be broader in humans than in other higher primates. Thus, we searched for specific signatures of evolutionary changes in promoter architectures of multiple hominid genes, including the genes active in human cortical neurons that may indicate an increase of variability of gene expression rather than just changes in the level of expression, such as downregulation or upregulation of the genes. We performed a whole-genome search for genetic-based alterations that may impact gene regulation "flexibility" in a process of hominids evolution, such as (i) CpG dinucleotide content, (ii) predicted nucleosome-DNA dissociation constant, and (iii) predicted affinities for TATA-binding protein (TBP) in gene promoters. We tested all putative promoter regions across the human genome and especially gene promoters in active chromatin state in neurons of prefrontal cortex, the brain region critical for abstract thinking and social and behavioral adaptation. Our data imply that the origin of modern man has been associated with an increase of flexibility of promoter-driven gene regulation in brain. In contrast, after splitting from the ancestral lineages of H. sapiens, the evolution of ape species is characterized by reduced flexibility of gene promoter functioning, underlying reduced variability of the gene expression.
Collapse
|
6
|
Chan KL, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low ETL. Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. BMC Bioinformatics 2017; 18:1426. [PMID: 28466793 PMCID: PMC5333190 DOI: 10.1186/s12859-016-1426-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. RESULTS We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). CONCLUSIONS Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
Collapse
Affiliation(s)
- Kuang-Lim Chan
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Rozana Rosli
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| | - Tatiana V. Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA USA
| | - Michael Hogan
- Orion Genomics, 4041 Forest Park Avenue, St. Louis, MO 63108 USA
| | - Mohd Firdaus-Raih
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| |
Collapse
|
7
|
Tatarinova TV, Chekalin E, Nikolsky Y, Bruskin S, Chebotarov D, McNally KL, Alexandrov N. Nucleotide diversity analysis highlights functionally important genomic regions. Sci Rep 2016; 6:35730. [PMID: 27774999 PMCID: PMC5075931 DOI: 10.1038/srep35730] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/30/2016] [Indexed: 12/15/2022] Open
Abstract
We analyzed functionality and relative distribution of genetic variants across the complete Oryza sativa genome, using the 40 million single nucleotide polymorphisms (SNPs) dataset from the 3,000 Rice Genomes Project (http://snp-seek.irri.org), the largest and highest density SNP collection for any higher plant. We have shown that the DNA-binding transcription factors (TFs) are the most conserved group of genes, whereas kinases and membrane-localized transporters are the most variable ones. TFs may be conserved because they belong to some of the most connected regulatory hubs that modulate transcription of vast downstream gene networks, whereas signaling kinases and transporters need to adapt rapidly to changing environmental conditions. In general, the observed profound patterns of nucleotide variability reveal functionally important genomic regions. As expected, nucleotide diversity is much higher in intergenic regions than within gene bodies (regions spanning gene models), and protein-coding sequences are more conserved than untranslated gene regions. We have observed a sharp decline in nucleotide diversity that begins at about 250 nucleotides upstream of the transcription start and reaches minimal diversity exactly at the transcription start. We found the transcription termination sites to have remarkably symmetrical patterns of SNP density, implying presence of functional sites near transcription termination. Also, nucleotide diversity was significantly lower near 3′ UTRs, the area rich with regulatory regions.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA.,Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Yuri Nikolsky
- Vavilov Institute of General Genetics, Moscow, Russia.,F1 Genomics, San Diego, CA, USA.,School of Systems Biology, George Mason University, VA, USA
| | | | - Dmitry Chebotarov
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | - Kenneth L McNally
- International Rice Research Institute, Los Baños, Laguna 4031, Philippines
| | | |
Collapse
|
8
|
Lis M, Walther D. The orientation of transcription factor binding site motifs in gene promoter regions: does it matter? BMC Genomics 2016; 17:185. [PMID: 26939991 PMCID: PMC4778318 DOI: 10.1186/s12864-016-2549-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 02/27/2016] [Indexed: 12/23/2022] Open
Abstract
Background Gene expression is to large degree regulated by the specific binding of protein transcription factors to cis-regulatory transcription factor binding sites in gene promoter regions. Despite the identification of hundreds of binding site sequence motifs, the question as to whether motif orientation matters with regard to the gene expression regulation of the respective downstream genes appears surprisingly underinvestigated. Results We pursued a statistical approach by probing 293 reported non-palindromic transcription factor binding site and ten core promoter motifs in Arabidopsis thaliana for evidence of any relevance of motif orientation based on mapping statistics and effects on the co-regulation of gene expression of the respective downstream genes. Although positional intervals closer to the transcription start site (TSS) were found with increased frequencies of motifs exhibiting orientation preference, a corresponding effect with regard to gene expression regulation as evidenced by increased co-expression of genes harboring the favored orientation in their upstream sequence could not be established. Furthermore, we identified an intrinsic orientational asymmetry of sequence regions close to the TSS as the likely source of the identified motif orientation preferences. By contrast, motif presence irrespective of orientation was found associated with pronounced effects on gene expression co-regulation validating the pursued approach. Inspecting motif pairs revealed statistically preferred orientational arrangements, but no consistent effect with regard to arrangement-dependent gene expression regulation was evident. Conclusions Our results suggest that for the motifs considered here, either no specific orientation rendering them functional across all their instances exists with orientational requirements instead depending on gene-locus specific additional factors, or that the binding orientation of transcription factors may generally not be relevant, but rather the event of binding itself. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2549-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Monika Lis
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany.
| | - Dirk Walther
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany.
| |
Collapse
|
9
|
Bourras S, Rouxel T, Meyer M. Agrobacterium tumefaciens Gene Transfer: How a Plant Pathogen Hacks the Nuclei of Plant and Nonplant Organisms. PHYTOPATHOLOGY 2015; 105:1288-1301. [PMID: 26151736 DOI: 10.1094/phyto-12-14-0380-rvw] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Agrobacterium species are soilborne gram-negative bacteria exhibiting predominantly a saprophytic lifestyle. Only a few of these species are capable of parasitic growth on plants, causing either hairy root or crown gall diseases. The core of the infection strategy of pathogenic Agrobacteria is a genetic transformation of the host cell, via stable integration into the host genome of a DNA fragment called T-DNA. This genetic transformation results in oncogenic reprogramming of the host to the benefit of the pathogen. This unique ability of interkingdom DNA transfer was largely used as a tool for genetic engineering. Thus, the artificial host range of Agrobacterium is continuously expanding and includes plant and nonplant organisms. The increasing availability of genomic tools encouraged genome-wide surveys of T-DNA tagged libraries, and the pattern of T-DNA integration in eukaryotic genomes was studied. Therefore, data have been collected in numerous laboratories to attain a better understanding of T-DNA integration mechanisms and potential biases. This review focuses on the intranuclear mechanisms necessary for proper targeting and stable expression of Agrobacterium oncogenic T-DNA in the host cell. More specifically, the role of genome features and the putative involvement of host's transcriptional machinery in relation to the T-DNA integration and effects on gene expression are discussed. Also, the mechanisms underlying T-DNA integration into specific genome compartments is reviewed, and a theoretical model for T-DNA intranuclear targeting is presented.
Collapse
Affiliation(s)
- Salim Bourras
- First, second, and third authors: INRA, UMR 1290 INRA-AgroParisTech BIOGER, Avenue Lucien Brétignières, BP 01, F-78850 Thiverval-Grignon, France
| | - Thierry Rouxel
- First, second, and third authors: INRA, UMR 1290 INRA-AgroParisTech BIOGER, Avenue Lucien Brétignières, BP 01, F-78850 Thiverval-Grignon, France
| | - Michel Meyer
- First, second, and third authors: INRA, UMR 1290 INRA-AgroParisTech BIOGER, Avenue Lucien Brétignières, BP 01, F-78850 Thiverval-Grignon, France
| |
Collapse
|
10
|
Tatarinova T, Elhaik E, Pellegrini M. Cross-species analysis of genic GC3 content and DNA methylation patterns. Genome Biol Evol 2013; 5:1443-56. [PMID: 23833164 PMCID: PMC3762193 DOI: 10.1093/gbe/evt103] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
The GC content in the third codon position (GC3) exhibits a unimodal distribution in many plant and animal genomes. Interestingly, grasses and homeotherm vertebrates exhibit a unique bimodal distribution. High GC3 was previously found to be associated with variable expression, higher frequency of upstream TATA boxes, and an increase of GC3 from 5′ to 3′. Moreover, GC3-rich genes are predominant in certain gene classes and are enriched in CpG dinucleotides that are potential targets for methylation. Based on the GC3 bimodal distribution we hypothesize that GC3 has a regulatory role involving methylation and gene expression. To test that hypothesis, we selected diverse taxa (rice, thale cress, bee, and human) that varied in the modality of their GC3 distribution and tested the association between GC3, DNA methylation, and gene expression. We examine the relationship between cytosine methylation levels and GC3, gene expression, genome signature, gene length, and other gene compositional features. We find a strong negative correlation (Pearson’s correlation coefficient r = −0.67, P value < 0.0001) between GC3 and genic CpG methylation. The comparison between 5′-3′ gradients of CG3-skew and genic methylation for the taxa in the study suggests interplay between gene-body methylation and transcription-coupled cytosine deamination effect. Compositional features are correlated with methylation levels of genes in rice, thale cress, human, bee, and fruit fly (which acts as an unmethylated control). These patterns allow us to generate evolutionary hypotheses about the relationships between GC3 and methylation and how these affect expression patterns. Specifically, we propose that the opposite effects of methylation and compositional gradients along coding regions of GC3-poor and GC3-rich genes are the products of several competing processes.
Collapse
Affiliation(s)
- Tatiana Tatarinova
- Laboratory of Applied Pharmacokinetics and Bioinformatics, University of Southern California.
| | | | | |
Collapse
|
11
|
Eukaryotic genomes may exhibit up to 10 generic classes of gene promoters. BMC Genomics 2012; 13:512. [PMID: 23020586 PMCID: PMC3549790 DOI: 10.1186/1471-2164-13-512] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2012] [Accepted: 09/13/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The main function of gene promoters appears to be the integration of different gene products in their biological pathways in order to maintain homeostasis. Generally, promoters have been classified in two major classes, namely TATA and CpG. Nevertheless, many genes using the same combinatorial formation of transcription factors have different gene expression patterns. Accordingly, we tried to ask ourselves some fundamental questions: Why certain genes have an overall predisposition for higher gene expression levels than others? What causes such a predisposition? Is there a structural relationship of these sequences in different tissues? Is there a strong phylogenetic relationship between promoters of closely related species? RESULTS In order to gain valuable insights into different promoter regions, we obtained a series of image-based patterns which allowed us to identify 10 generic classes of promoters. A comprehensive analysis was undertaken for promoter sequences from Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa, and a more extensive analysis of tissue-specific promoters in humans. We observed a clear preference for these species to use certain classes of promoters for specific biological processes. Moreover, in humans, we found that different tissues use distinct classes of promoters, reflecting an emerging promoter network. Depending on the tissue type, comparisons made between these classes of promoters reveal a complementarity between their patterns whereas some other classes of promoters have been observed to occur in competition. Furthermore, we also noticed the existence of some transitional states between these classes of promoters that may explain certain evolutionary mechanisms, which suggest a possible predisposition for specific levels of gene expression and perhaps for a different number of factors responsible for triggering gene expression. Our conclusions are based on comprehensive data from three different databases and a new computer model whose core is using Kappa index of coincidence. CONCLUSIONS To fully understand the connections between gene promoters and gene expression, we analyzed thousands of promoter sequences using our Kappa Index of Coincidence method and a specialized Optical Character Recognition (OCR) neural network. Under our criteria, 10 classes of promoters were detected. In addition, the existence of "transitional" promoters suggests that there is an evolutionary weighted continuum between classes, depending perhaps upon changes in their gene products.
Collapse
|
12
|
Arakawa K, Tomita M. Measures of compositional strand bias related to replication machinery and its applications. Curr Genomics 2012; 13:4-15. [PMID: 22942671 PMCID: PMC3269016 DOI: 10.2174/138920212799034749] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2011] [Revised: 09/10/2011] [Accepted: 09/20/2011] [Indexed: 11/22/2022] Open
Abstract
The compositional asymmetry of complementary bases in nucleotide sequences implies the existence of a mutational or selectional bias in the two strands of the DNA duplex, which is commonly shaped by strand-specific mechanisms in transcription or replication. Such strand bias in genomes, frequently visualized by GC skew graphs, is used for the computational prediction of transcription start sites and replication origins, as well as for comparative evolutionary genomics studies. The use of measures of compositional strand bias in order to quantify the degree of strand asymmetry is crucial, as it is the basis for determining the applicability of compositional analysis and comparing the strength of the mutational bias in different biological machineries in various species. Here, we review the measures of strand bias that have been proposed to date, including the ∆GC skew, the B1 index, the predictability score of linear discriminant analysis for gene orientation, the signal-to-noise ratio of the oligonucleotide bias, and the GC skew index. These measures have been predominantly designed for and applied to the analysis of replication-related mutational processes in prokaryotes, but we also give research examples in eukaryotes.
Collapse
Affiliation(s)
- Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | | |
Collapse
|
13
|
Incidence of genome structure, DNA asymmetry, and cell physiology on T-DNA integration in chromosomes of the phytopathogenic fungus Leptosphaeria maculans. G3-GENES GENOMES GENETICS 2012; 2:891-904. [PMID: 22908038 PMCID: PMC3411245 DOI: 10.1534/g3.112.002048] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2012] [Accepted: 06/07/2012] [Indexed: 11/18/2022]
Abstract
The ever-increasing generation of sequence data is accompanied by unsatisfactory functional annotation, and complex genomes, such as those of plants and filamentous fungi, show a large number of genes with no predicted or known function. For functional annotation of unknown or hypothetical genes, the production of collections of mutants using Agrobacterium tumefaciens–mediated transformation (ATMT) associated with genotyping and phenotyping has gained wide acceptance. ATMT is also widely used to identify pathogenicity determinants in pathogenic fungi. A systematic analysis of T-DNA borders was performed in an ATMT-mutagenized collection of the phytopathogenic fungus Leptosphaeria maculans to evaluate the features of T-DNA integration in its particular transposable element-rich compartmentalized genome. A total of 318 T-DNA tags were recovered and analyzed for biases in chromosome and genic compartments, existence of CG/AT skews at the insertion site, and occurrence of microhomologies between the T-DNA left border (LB) and the target sequence. Functional annotation of targeted genes was done using the Gene Ontology annotation. The T-DNA integration mainly targeted gene-rich, transcriptionally active regions, and it favored biological processes consistent with the physiological status of a germinating spore. T-DNA integration was strongly biased toward regulatory regions, and mainly promoters. Consistent with the T-DNA intranuclear-targeting model, the density of T-DNA insertion correlated with CG skew near the transcription initiation site. The existence of microhomologies between promoter sequences and the T-DNA LB flanking sequence was also consistent with T-DNA integration to host DNA mediated by homologous recombination based on the microhomology-mediated end-joining pathway.
Collapse
|
14
|
McLean MA, Tirosh I. Opposite GC skews at the 5' and 3' ends of genes in unicellular fungi. BMC Genomics 2011; 12:638. [PMID: 22208287 PMCID: PMC3315797 DOI: 10.1186/1471-2164-12-638] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/30/2011] [Indexed: 11/24/2022] Open
Abstract
Background GC-skews have previously been linked to transcription in some eukaryotes. They have been associated with transcription start sites, with the coding strand G-biased in mammals and C-biased in fungi and invertebrates. Results We show a consistent and highly significant pattern of GC-skew within genes of almost all unicellular fungi. The pattern of GC-skew is asymmetrical: the coding strand of genes is typically C-biased at the 5' ends but G-biased at the 3' ends, with intermediate skews at the middle of genes. Thus, the initiation, elongation, and termination phases of transcription are associated with different skews. This pattern influences the encoded proteins by generating differential usage of amino acids at the 5' and 3' ends of genes. These biases also affect fourfold-degenerate positions and extend into promoters and 3' UTRs, indicating that skews cannot be accounted by selection for protein function or translation. Conclusions We propose two explanations, the mutational pressure hypothesis, and the adaptive hypothesis. The mutational pressure hypothesis is that different co-factors bind to RNA pol II at different phases of transcription, producing different mutational regimes. The adaptive hypothesis is that cytidine triphosphate deficiency may lead to C-avoidance at the 3' ends of transcripts to control the flow of RNA pol II molecules and reduce their frequency of collisions.
Collapse
Affiliation(s)
- Malcolm A McLean
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.
| | | |
Collapse
|
15
|
Synonymous Codon Usage, GC3, and Evolutionary Patterns Across Plastomes of Three Pooid Model Species: Emerging Grass Genome Models for Monocots. Mol Biotechnol 2011; 49:116-28. [DOI: 10.1007/s12033-011-9383-9] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
16
|
Medvedeva YA, Kulakovskii IV, Oparina NY, Favorov AV, Makeev VY. The GC skew near Pol II start sites and its association with SP1-binding site variants. Biophysics (Nagoya-shi) 2010. [DOI: 10.1134/s0006350910060023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
17
|
Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics 2010; 97:112-20. [PMID: 21112384 DOI: 10.1016/j.ygeno.2010.11.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Revised: 11/05/2010] [Accepted: 11/12/2010] [Indexed: 11/20/2022]
Abstract
Accurate identification of core promoters is important for gaining more insight about the understanding of the eukaryotic transcription regulation. In this study, the authors focused on the biologically realistic promoter prediction of plant genomes. By analyzing the correlative conservation, GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantPromDB, a hybrid multi-feature approach based on support vector machine (SVM) for predicting the two types of promoters were developed by integrating local word content, GC-Skew and DNA geometric flexibility. Compared with the TSSP-TCM program on the same test dataset, better prediction results were obtained. Especially for the TATA-less promoter, the accuracy is 10% higher than the result of TSSP-TCM program. The good performance of the hybrid promoters and the experimental data also indicate that our method has the ability to locate the promoter region of the plant genome.
Collapse
|
18
|
Palidwor GA, Perkins TJ, Xia X. A general model of codon bias due to GC mutational bias. PLoS One 2010; 5:e13431. [PMID: 21048949 PMCID: PMC2965080 DOI: 10.1371/journal.pone.0013431] [Citation(s) in RCA: 122] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2010] [Accepted: 09/10/2010] [Indexed: 12/04/2022] Open
Abstract
Background In spite of extensive research on the effect of mutation and selection on codon usage, a general model of codon usage bias due to mutational bias has been lacking. Because most amino acids allow synonymous GC content changing substitutions in the third codon position, the overall GC bias of a genome or genomic region is highly correlated with GC3, a measure of third position GC content. For individual amino acids as well, G/C ending codons usage generally increases with increasing GC bias and decreases with increasing AT bias. Arginine and leucine, amino acids that allow GC-changing synonymous substitutions in the first and third codon positions, have codons which may be expected to show different usage patterns. Principal Findings In analyzing codon usage bias in hundreds of prokaryotic and plant genomes and in human genes, we find that two G-ending codons, AGG (arginine) and TTG (leucine), unlike all other G/C-ending codons, show overall usage that decreases with increasing GC bias, contrary to the usual expectation that G/C-ending codon usage should increase with increasing genomic GC bias. Moreover, the usage of some codons appears nonlinear, even nonmonotone, as a function of GC bias. To explain these observations, we propose a continuous-time Markov chain model of GC-biased synonymous substitution. This model correctly predicts the qualitative usage patterns of all codons, including nonlinear codon usage in isoleucine, arginine and leucine. The model accounts for 72%, 64% and 52% of the observed variability of codon usage in prokaryotes, plants and human respectively. When codons are grouped based on common GC content, 87%, 80% and 68% of the variation in usage is explained for prokaryotes, plants and human respectively. Conclusions The model clarifies the sometimes-counterintuitive effects that GC mutational bias can have on codon usage, quantifies the influence of GC mutational bias and provides a natural null model relative to which other influences on codon bias may be measured.
Collapse
|
19
|
Troukhan M, Tatarinova T, Bouck J, Flavell RB, Alexandrov NN. Genome-wide discovery of cis-elements in promoter sequences using gene expression. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 13:139-51. [PMID: 19231992 DOI: 10.1089/omi.2008.0034] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The availability of complete or nearly complete genome sequences, a large number of 5' expressed sequence tags, and significant public expression data allow for a more accurate identification of cis-elements regulating gene expression. We have implemented a global approach that takes advantage of available expression data, genomic sequences, and transcript information to predict cis-elements associated with specific expression patterns. The key components of our approach are: (1) precise identification of transcription start sites, (2) specific locations of cis-elements relative to the transcription start site, and (3) assessment of statistical significance for all sequence motifs. By applying our method to promoters of Arabidopsis thaliana and Mus musculus, we have identified motifs that affect gene expression under specific environmental conditions or in certain tissues. We also found that the presence of the TATA box is associated with increased variability of gene expression. Strong correlation between our results and experimentally determined motifs shows that the method is capable of predicting new functionally important cis-elements in promoter sequences.
Collapse
Affiliation(s)
- Maxim Troukhan
- Ceres, Inc. 1535 Rancho Conejo Road, Thousand Oaks, CA 91310, USA
| | | | | | | | | |
Collapse
|
20
|
Tatarinova TV, Alexandrov NN, Bouck JB, Feldmann KA. GC3 biology in corn, rice, sorghum and other grasses. BMC Genomics 2010; 11:308. [PMID: 20470436 PMCID: PMC2895627 DOI: 10.1186/1471-2164-11-308] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2009] [Accepted: 05/16/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The third, or wobble, position in a codon provides a high degree of possible degeneracy and is an elegant fault-tolerance mechanism. Nucleotide biases between organisms at the wobble position have been documented and correlated with the abundances of the complementary tRNAs. We and others have noticed a bias for cytosine and guanine at the third position in a subset of transcripts within a single organism. The bias is present in some plant species and warm-blooded vertebrates but not in all plants, or in invertebrates or cold-blooded vertebrates. RESULTS Here we demonstrate that in certain organisms the amount of GC at the wobble position (GC3) can be used to distinguish two classes of genes. We highlight the following features of genes with high GC3 content: they (1) provide more targets for methylation, (2) exhibit more variable expression, (3) more frequently possess upstream TATA boxes, (4) are predominant in certain classes of genes (e.g., stress responsive genes) and (5) have a GC3 content that increases from 5'to 3'. These observations led us to formulate a hypothesis to explain GC3 bimodality in grasses. CONCLUSIONS Our findings suggest that high levels of GC3 typify a class of genes whose expression is regulated through DNA methylation or are a legacy of accelerated evolution through gene conversion. We discuss the three most probable explanations for GC3 bimodality: biased gene conversion, transcriptional and translational advantage and gene methylation.
Collapse
Affiliation(s)
- Tatiana V Tatarinova
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA.
| | | | | | | |
Collapse
|
21
|
Civán P, Svec M. Genome-wide analysis of rice (Oryza sativa L. subsp. japonica) TATA box and Y Patch promoter elements. Genome 2009; 52:294-7. [PMID: 19234558 DOI: 10.1139/g09-001] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The TATA box is one of the best characterized transcription factor binding sites. However, it is not a ubiquitous element of core promoters, and other sequence motifs such as Y Patches seem to play a major role in plants. Here, we present a first genome-wide computational analysis of the TATA box and Y Patch distribution in rice (Oryza sativa L. subsp. japonica) promoter sequences. Utilizing a probabilistic sequence model, we ascertain that only approximately 19% of rice genes possess the TATA box, but approximately 50% contain one or more Y Patches in their core promoters. By computational processing of identified elements, we generated extended TATA box and Y Patch nucleotide frequency matrices capable of predicting these motifs in plants with a high degree of confidence.
Collapse
Affiliation(s)
- P Civán
- Department of Genetics, Faculty of Natural Sciences, Comenius University, Mlynská dolina B-1, 842 15 Bratislava, Slovakia.
| | | |
Collapse
|
22
|
Alexandrov NN, Brover VV, Freidin S, Troukhan ME, Tatarinova TV, Zhang H, Swaller TJ, Lu YP, Bouck J, Flavell RB, Feldmann KA. Insights into corn genes derived from large-scale cDNA sequencing. PLANT MOLECULAR BIOLOGY 2009; 69:179-94. [PMID: 18937034 PMCID: PMC2709227 DOI: 10.1007/s11103-008-9415-4] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2008] [Accepted: 10/01/2008] [Indexed: 05/19/2023]
Abstract
We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701-EU977132 (FLI cDNA) and FK944382-FL482108 (EST).
Collapse
|
23
|
Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J. Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res 2007; 35:6219-26. [PMID: 17855401 PMCID: PMC2094075 DOI: 10.1093/nar/gkm685] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Mammalian promoters are categorized into TATA and CpG-related groups, and they have complementary roles associated with differentiated transcriptional characteristics. While the TATA box is also found in plant promoters, it is not known if CpG-type promoters exist in plants. Plant promoters contain Y Patches (pyrimidine patches) in the core promoter region, and the ubiquity of these beyond higher plants is not understood as well. Sets of promoter sequences were utilized for the analysis of local distribution of short sequences (LDSS), and approximately one thousand octamer sequences have been identified as promoter constituents from Arabidopsis, rice, human and mouse, respectively. Based on their localization profiles, the identified octamer sequences were classified into several major groups, REG (Regulatory Element Group), TATA box, Inr (Initiator), Kozak, CpG and Y Patch. Comparison of the four species has revealed three categories: (i) shared groups found in both plants and mammals (TATA box), (ii) common groups found in both kingdoms but the utilized sequence is differentiated (REG, Inr and Kozak) and (iii) specific groups found in either plants or mammals (CpG and Y Patch). Our comparative LDSS analysis has identified conservation and differentiation of promoter architectures between higher plants and mammals.
Collapse
|
24
|
Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 2007; 8:67. [PMID: 17346352 PMCID: PMC1832190 DOI: 10.1186/1471-2164-8-67] [Citation(s) in RCA: 110] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Accepted: 03/08/2007] [Indexed: 11/20/2022] Open
Abstract
Background Plant promoter architecture is important for understanding regulation and evolution of the promoters, but our current knowledge about plant promoter structure, especially with respect to the core promoter, is insufficient. Several promoter elements including TATA box, and several types of transcriptional regulatory elements have been found to show local distribution within promoters, and this feature has been successfully utilized for extraction of promoter constituents from human genome. Results LDSS (Local Distribution of Short Sequences) profiles of short sequences along the plant promoter have been analyzed in silico, and hundreds of hexamer and octamer sequences have been identified as having localized distributions within promoters of Arabidopsis thaliana and rice. Based on their localization patterns, the identified sequences could be classified into three groups, pyrimidine patch (Y Patch), TATA box, and REG (Regulatory Element Group). Sequences of the TATA box group are consistent with the ones reported in previous studies. The REG group includes more than 200 sequences, and half of them correspond to known cis-elements. The other REG subgroups, together with about a hundred uncategorized sequences, are suggested to be novel cis-regulatory elements. Comparison of LDSS-positive sequences between Arabidopsis and rice has revealed moderate conservation of elements and common promoter architecture. In addition, a dimer motif named the YR Rule (C/T A/G) has been identified at the transcription start site (-1/+1). This rule also fits both Arabidopsis and rice promoters. Conclusion LDSS was successfully applied to plant genomes and hundreds of putative promoter elements have been extracted as LDSS-positive octamers. Identified promoter architecture of monocot and dicot are well conserved, but there are moderate variations in the utilized sequences.
Collapse
Affiliation(s)
- Yoshiharu Y Yamamoto
- Application and Development Group, RIKEN FRS, Hirosawa 2-1, Wako, Saitama 351-0198, Japan
- Center for Gene Research, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8602, Japan
| | - Hiroyuki Ichida
- Application and Development Group, RIKEN FRS, Hirosawa 2-1, Wako, Saitama 351-0198, Japan
- Graduate School of Science and Technology, Chiba University, Matsudo 648, Matsudo, Chiba 271-8510, Japan
| | - Minami Matsui
- RIKEN Genomic Sciences Center, Suehirocho 1-7-22, Tsurumiku, Yokohama, Kanagawa 230-0045, Japan
| | - Junichi Obokata
- Center for Gene Research, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8602, Japan
| | - Tetsuya Sakurai
- RIKEN Plant Science Center, Suehirocho 1-7-22, Tsurumiku, Yokohama, Kanagawa 230-0045, Japan
| | - Masakazu Satou
- Graduate School of Science and Technology, Chiba University, Matsudo 648, Matsudo, Chiba 271-8510, Japan
| | - Motoaki Seki
- Graduate School of Science and Technology, Chiba University, Matsudo 648, Matsudo, Chiba 271-8510, Japan
| | - Kazuo Shinozaki
- RIKEN Plant Science Center, Suehirocho 1-7-22, Tsurumiku, Yokohama, Kanagawa 230-0045, Japan
| | - Tomoko Abe
- Application and Development Group, RIKEN FRS, Hirosawa 2-1, Wako, Saitama 351-0198, Japan
| |
Collapse
|
25
|
Paz A, Mester D, Nevo E, Korol A. Looking for organization patterns of highly expressed genes: purine-pyrimidine composition of precursor mRNAs. J Mol Evol 2007; 64:248-60. [PMID: 17211550 DOI: 10.1007/s00239-006-0135-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2006] [Accepted: 11/19/2006] [Indexed: 01/05/2023]
Abstract
We analyzed precursor messenger RNAs (pre-mRNAs) of 12 eukaryotic species. In each species, three groups of highly expressed genes, ribosomal proteins, heat shock proteins, and amino-acyl tRNA synthetases, were compared with a control group (randomly selected genes). The purine-pyrimidine (R-Y) composition of pre-mRNAs of the three targeted gene groups proved to differ significantly from the control. The exons of the three groups tested have higher purine contents and R-tract abundance and lower abundance of Y-tracts compared to the control (R-tract-tract of sequential purines with Rn>or=5; Y-tract-tract of sequential pyrimidines with Yn>or=5). In species widely employing "intron definition" in the splicing process, the Y content of introns of the three targeted groups appeared to be higher compared to the control group. Furthermore, in all examined species, the introns of the targeted genes have a lower abundance of R-tracts compared to the control. We hypothesized that the R-Y composition of the targeted gene groups contributes to high rate and efficiency of both splicing and translation, in addition to the mRNA coding role. This is presumably achieved by (1) reducing the possibility of the formation of secondary structures in the mRNA, (2) using the R-tracts and R-biased sequences as exonic splicing enhancers, (3) lowering the amount of targets for pyrimidine tract binding protein in the exons, and (4) reducing the amount of target sequences for binding of serine/arginine-rich (SR) proteins in the introns, thereby allowing SR proteins to bind to proper (exonic) targets.
Collapse
Affiliation(s)
- A Paz
- Institute of Evolution, Haifa University, Mount Carmel, Haifa, 31905, Israel
| | | | | | | |
Collapse
|
26
|
Schneeberger RG, Zhang K, Tatarinova T, Troukhan M, Kwok SF, Drais J, Klinger K, Orejudos F, Macy K, Bhakta A, Burns J, Subramanian G, Donson J, Flavell R, Feldmann KA. Agrobacterium T-DNA integration in Arabidopsis is correlated with DNA sequence compositions that occur frequently in gene promoter regions. Funct Integr Genomics 2005; 5:240-53. [PMID: 15744539 DOI: 10.1007/s10142-005-0138-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2004] [Revised: 12/30/2004] [Accepted: 01/01/2005] [Indexed: 10/25/2022]
Abstract
Mobile insertion elements such as transposons and T-DNA generate useful genetic variation and are important tools for functional genomics studies in plants and animals. The spectrum of mutations obtained in different systems can be highly influenced by target site preferences inherent in the mechanism of DNA integration. We investigated the target site preferences of Agrobacterium T-DNA insertions in the chromosomes of the model plant Arabidopsis thaliana. The relative frequencies of insertions in genic and intergenic regions of the genome were calculated and DNA composition features associated with the insertion site flanking sequences were identified. Insertion frequencies across the genome indicate that T-strand integration is suppressed near centromeres and rDNA loci, progressively increases towards telomeres, and is highly correlated with gene density. At the gene level, T-DNA integration events show a statistically significant preference for insertion in the 5' and 3' flanking regions of protein coding sequences as well as the promoter region of RNA polymerase I transcribed rRNA gene repeats. The increased insertion frequencies in 5' upstream regions compared to coding sequences are positively correlated with gene expression activity and DNA sequence composition. Analysis of the relationship between DNA sequence composition and gene activity further demonstrates that DNA sequences with high CG-skew ratios are consistently correlated with T-DNA insertion site preference and high gene expression. The results demonstrate genomic and gene-specific preferences for T-strand integration and suggest that DNA sequences with a pronounced transition in CG- and AT-skew ratios are preferred targets for T-DNA integration.
Collapse
|
27
|
Fujimori S, Washio T, Tomita M. GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics 2005; 6:26. [PMID: 15733327 PMCID: PMC555766 DOI: 10.1186/1471-2164-6-26] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2004] [Accepted: 02/28/2005] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND A GC-compositional strand bias or GC-skew (=(C-G)/(C+G)), where C and G denote the numbers of cytosine and guanine residues, was recently reported near the transcription start sites (TSS) of Arabidopsis genes. However, it is unclear whether other eukaryotic species have equally prominent GC-skews, and the biological meaning of this trait remains unknown. RESULTS Our study confirmed a significant GC-skew (C > G) in the TSS of Oryza sativa (rice) genes. The full-length cDNAs and genomic sequences from Arabidopsis and rice were compared using statistical analyses. Despite marked differences in the G+C content around the TSS in the two plants, the degrees of bias were almost identical. Although slight GC-skew peaks, including opposite skews (C < G), were detected around the TSS of genes in human and Drosophila, they were qualitatively and quantitatively different from those identified in plants. However, plant-like GC-skew in regions upstream of the translation initiation sites (TIS) in some fungi was identified following analyses of the expressed sequence tags and/or genomic sequences from other species. On the basis of our dataset, we estimated that > 70 and 68% of Arabidopsis and rice genes, respectively, had a strong GC-skew (> 0.33) in a 100-bp window (that is, the number of C residues was more than double the number of G residues in a +/-100-bp window around the TSS). The mean GC-skew value in the TSS of highly-expressed genes in Arabidopsis was significantly greater than that of genes with low expression levels. Many of the GC-skew peaks were preferentially located near the TSS, so we examined the potential value of GC-skew as an index for TSS identification. Our results confirm that the GC-skew can be used to assist the TSS prediction in plant genomes. CONCLUSION The GC-skew (C > G) around the TSS is strictly conserved between monocot and eudicot plants (ie. angiosperms in general), and a similar skew has been observed in some fungi. Highly-expressed Arabidopsis genes had overall a more marked GC-skew in the TSS compared to genes with low expression levels. We therefore propose that the GC-skew around the TSS in some plants and fungi is related to transcription. It might be caused by mutations during transcription initiation or the frequent use of transcription factor-biding sites having a strand preference. In addition, GC-skew is a good candidate index for TSS prediction in plant genomes, where there is a lack of correlation among CpG islands and genes.
Collapse
Affiliation(s)
- Shigeo Fujimori
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata 997-0035, Japan
- Graduate School of Media and Governance, Keio University, Fujisawa, Kanagawa 252-8520, Japan
| | - Takanori Washio
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata 997-0035, Japan
- Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Nara 630-0192, Japan
| | - Masaru Tomita
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata 997-0035, Japan
- Department of Environmental Information, Keio University, Fujisawa, Kanagawa 252-8520, Japan
| |
Collapse
|
28
|
Aerts S, Thijs G, Dabrowski M, Moreau Y, De Moor B. Comprehensive analysis of the base composition around the transcription start site in Metazoa. BMC Genomics 2004; 5:34. [PMID: 15171795 PMCID: PMC436054 DOI: 10.1186/1471-2164-5-34] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2004] [Accepted: 06/01/2004] [Indexed: 11/29/2022] Open
Abstract
Background The transcription start site of a metazoan gene remains poorly understood, mostly because there is no clear signal present in all genes. Now that several sequenced metazoan genomes have been annotated, we have been able to compare the base composition around the transcription start site for all annotated genes across multiple genomes. Results The most prominent feature in the base compositions is a significant local variation in G+C content over a large region around the transcription start site. The change is present in all animal phyla but the extent of variation is different between distinct classes of vertebrates, and the shape of the variation is completely different between vertebrates and arthropods. Furthermore, the height of the variation correlates with CpG frequencies in vertebrates but not in invertebrates and it also correlates with gene expression, especially in mammals. We also detect GC and AT skews in all clades (where %G is not equal to %C or %A is not equal to %T respectively) but these occur in a more confined region around the transcription start site and in the coding region. Conclusions The dramatic changes in nucleotide composition in humans are a consequence of CpG nucleotide frequencies and of gene expression, the changes in Fugu could point to primordial CpG islands, and the changes in the fly are of a totally different kind and unrelated to dinucleotide frequencies.
Collapse
Affiliation(s)
- Stein Aerts
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Belgium
| | - Gert Thijs
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Belgium
| | - Michal Dabrowski
- Laboratory of Transcription Regulation, Nencki Institute, Warsaw, Poland
| | - Yves Moreau
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Belgium
- On leave at Center for Biological Sequence Analysis, BioCentrum, Technical University of Denmark, Lyngby, Denmark
| | - Bart De Moor
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Belgium
| |
Collapse
|