51
|
Nielsen MM, Tehler D, Vang S, Sudzina F, Hedegaard J, Nordentoft I, Ørntoft TF, Lund AH, Pedersen JS. Identification of expressed and conserved human noncoding RNAs. RNA (NEW YORK, N.Y.) 2014; 20:236-251. [PMID: 24344320 PMCID: PMC3895275 DOI: 10.1261/rna.038927.113] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2013] [Accepted: 11/07/2013] [Indexed: 06/03/2023]
Abstract
The past decade has shown mammalian genomes to be pervasively transcribed and identified thousands of noncoding (nc) transcripts. It is currently unclear to what extent these transcripts are of functional importance, as experimental functional evidence exists for only a small fraction. Here, we characterize the expression and evolutionary conservation properties of 12,115 known and novel nc transcripts, including structural RNAs, long nc RNAs (lncRNAs), antisense RNAs, EvoFold predictions, ultraconserved elements, and expressed nc regions. Expression levels are evaluated across 12 human tissues using a custom-designed microarray, supplemented with RNAseq. Conservation levels are evaluated at both the base level and at the syntenic level. We combine these measures with epigenetic mark annotations to identify subsets of novel nc transcripts that show characteristics similar to known functional ncRNAs. Few novel nc transcripts show both high expression and conservation levels. However, overall, we observe a positive correlation between expression and both conservation and epigenetic annotations, suggesting that a subset of the expressed transcripts are under purifying selection and likely functional. The identified subsets of expressed and conserved novel nc transcripts may form the basis for further functional characterization.
Collapse
Affiliation(s)
- Morten Muhlig Nielsen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Disa Tehler
- Biotech Research and Innovation Centre, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Søren Vang
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Frantisek Sudzina
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Jakob Hedegaard
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Iver Nordentoft
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Torben Falck Ørntoft
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| | - Anders H. Lund
- Biotech Research and Innovation Centre, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Jakob Skou Pedersen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Skejby, DK-8200 Aarhus N, Denmark
| |
Collapse
|
52
|
Wong KC, Zhang Z. SNPdryad: predicting deleterious non-synonymous human SNPs using only orthologous protein sequences. ACTA ACUST UNITED AC 2014; 30:1112-1119. [PMID: 24389653 DOI: 10.1093/bioinformatics/btt769] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Accepted: 12/13/2013] [Indexed: 11/12/2022]
Abstract
MOTIVATION The recent advances in genome sequencing have revealed an abundance of non-synonymous polymorphisms among human individuals; subsequently, it is of immense interest and importance to predict whether such substitutions are functional neutral or have deleterious effects. The accuracy of such prediction algorithms depends on the quality of the multiple-sequence alignment, which is used to infer how an amino acid substitution is tolerated at a given position. Because of the scarcity of orthologous protein sequences in the past, the existing prediction algorithms all include sequences of protein paralogs in the alignment, which can dilute the conservation signal and affect prediction accuracy. However, we believe that, with the sequencing of a large number of mammalian genomes, it is now feasible to include only protein orthologs in the alignment and improve the prediction performance. RESULTS We have developed a novel prediction algorithm, named SNPdryad, which only includes protein orthologs in building a multiple sequence alignment. Among many other innovations, SNPdryad uses different conservation scoring schemes and uses Random Forest as a classifier. We have tested SNPdryad on several datasets. We found that SNPdryad consistently outperformed other methods in several performance metrics, which is attributed to the exclusion of paralogous sequence. We have run SNPdryad on the complete human proteome, generating prediction scores for all the possible amino acid substitutions. AVAILABILITY AND IMPLEMENTATION The algorithm and the prediction results can be accessed from the Web site: http://snps.ccbr.utoronto.ca:8080/SNPdryad/ CONTACT: Zhaolei.Zhang@utoronto.ca Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8
| | - Zhaolei Zhang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4 The Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5S 3E1 and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8
| |
Collapse
|
53
|
Affiliation(s)
- Robert J Weatheritt
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, UK
| | | |
Collapse
|
54
|
Stergachis AB, Haugen E, Shafer A, Fu W, Vernot B, Reynolds A, Raubitschek A, Ziegler S, LeProust EM, Akey JM, Stamatoyannopoulos JA. Exonic transcription factor binding directs codon choice and affects protein evolution. Science 2013; 342:1367-72. [PMID: 24337295 DOI: 10.1126/science.1243490] [Citation(s) in RCA: 208] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Genomes contain both a genetic code specifying amino acids and a regulatory code specifying transcription factor (TF) recognition sequences. We used genomic deoxyribonuclease I footprinting to map nucleotide resolution TF occupancy across the human exome in 81 diverse cell types. We found that ~15% of human codons are dual-use codons ("duons") that simultaneously specify both amino acids and TF recognition sites. Duons are highly conserved and have shaped protein evolution, and TF-imposed constraint appears to be a major driver of codon usage bias. Conversely, the regulatory code has been selectively depleted of TFs that recognize stop codons. More than 17% of single-nucleotide variants within duons directly alter TF binding. Pervasive dual encoding of amino acid and regulatory information appears to be a fundamental feature of genome evolution.
Collapse
Affiliation(s)
- Andrew B Stergachis
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
55
|
Smith MA, Gesell T, Stadler PF, Mattick JS. Widespread purifying selection on RNA structure in mammals. Nucleic Acids Res 2013; 41:8220-36. [PMID: 23847102 PMCID: PMC3783177 DOI: 10.1093/nar/gkt596] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Revised: 05/29/2013] [Accepted: 06/16/2013] [Indexed: 12/14/2022] Open
Abstract
Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions. We present a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data obtained from two refined structure prediction algorithms, RNAz and SISSIz, were then analyzed to fine-tune the parameters of an optimized workflow for genomic sliding window screens. When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5-22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional. As an example, our findings identify both known and novel conserved RNA structure motifs in the long noncoding RNA MALAT1. This study provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying the developmental ontologies of higher eukaryotes.
Collapse
Affiliation(s)
- Martin A. Smith
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Tanja Gesell
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Peter F. Stadler
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - John S. Mattick
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| |
Collapse
|
56
|
Viral proteins originated de novo by overprinting can be identified by codon usage: application to the "gene nursery" of Deltaretroviruses. PLoS Comput Biol 2013; 9:e1003162. [PMID: 23966842 PMCID: PMC3744397 DOI: 10.1371/journal.pcbi.1003162] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2012] [Accepted: 06/13/2013] [Indexed: 12/24/2022] Open
Abstract
A well-known mechanism through which new protein-coding genes originate is by modification of pre-existing genes, e.g. by duplication or horizontal transfer. In contrast, many viruses generate protein-coding genes de novo, via the overprinting of a new reading frame onto an existing (“ancestral”) frame. This mechanism is thought to play an important role in viral pathogenicity, but has been poorly explored, perhaps because identifying the de novo frames is very challenging. Therefore, a new approach to detect them was needed. We assembled a reference set of overlapping genes for which we could reliably determine the ancestral frames, and found that their codon usage was significantly closer to that of the rest of the viral genome than the codon usage of de novo frames. Based on this observation, we designed a method that allowed the identification of de novo frames based on their codon usage with a very good specificity, but intermediate sensitivity. Using our method, we predicted that the Rex gene of deltaretroviruses has originated de novo by overprinting the Tax gene. Intriguingly, several genes in the same genomic region have also originated de novo and encode proteins that regulate the functions of Tax. Such “gene nurseries” may be common in viral genomes. Finally, our results confirm that the genomic GC content is not the only determinant of codon usage in viruses and suggest that a constraint linked to translation must influence codon usage. How does novelty originate in nature? It is commonly thought that new genes are generated mainly by modifications of existing genes (the “tinkering” model). In contrast, we have shown recently that in viruses, numerous genes are generated entirely de novo (“from scratch”). The role of these genes remains underexplored, however, because they are difficult to identify. We have therefore developed a new method to detect genes originated de novo in viral genomes, based on the observation that each viral genome has a unique “signature”, which genes originated de novo do not share. We applied this method to analyze the genes of Human T-Lymphotropic Virus 1 (HTLV1), a relative of the HIV virus and also a major human pathogen that infects about twenty million people worldwide. The life cycle of HTLV1 is finely regulated – it can stay dormant for long periods and can provoke blood cancers (leukemias) after a very long incubation. We discovered that several of the genes of HTLV1 have originated de novo. These novel genes play a key role in regulating the life cycle of HTLV1, and presumably its pathogenicity. Our investigations suggest that such “gene nurseries” may be common in viruses.
Collapse
|
57
|
Identification of an overprinting gene in Merkel cell polyomavirus provides evolutionary insight into the birth of viral genes. Proc Natl Acad Sci U S A 2013; 110:12744-9. [PMID: 23847207 DOI: 10.1073/pnas.1303526110] [Citation(s) in RCA: 129] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Many viruses use overprinting (alternate reading frame utilization) as a means to increase protein diversity in genomes severely constrained by size. However, the evolutionary steps that facilitate the de novo generation of a novel protein within an ancestral ORF have remained poorly characterized. Here, we describe the identification of an overprinting gene, expressed from an Alternate frame of the Large T Open reading frame (ALTO) in the early region of Merkel cell polyomavirus (MCPyV), the causative agent of most Merkel cell carcinomas. ALTO is expressed during, but not required for, replication of the MCPyV genome. Phylogenetic analysis reveals that ALTO is evolutionarily related to the middle T antigen of murine polyomavirus despite almost no sequence similarity. ALTO/MT arose de novo by overprinting of the second exon of T antigen in the common ancestor of a large clade of mammalian polyomaviruses. Taking advantage of the low evolutionary divergence and diverse sampling of polyomaviruses, we propose evolutionary transitions that likely gave birth to this protein. We suggest that two highly constrained regions of the large T antigen ORF provided a start codon and C-terminal hydrophobic motif necessary for cellular localization of ALTO. These two key features, together with stochastic erasure of intervening stop codons, resulted in a unique protein-coding capacity that has been preserved ever since its birth. Our study not only reveals a previously undefined protein encoded by several polyomaviruses including MCPyV, but also provides insight into de novo protein evolution.
Collapse
|
58
|
Chursov A, Frishman D, Shneider A. Conservation of mRNA secondary structures may filter out mutations in Escherichia coli evolution. Nucleic Acids Res 2013; 41:7854-60. [PMID: 23783573 PMCID: PMC3763529 DOI: 10.1093/nar/gkt507] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Recent reports indicate that mutations in viral genomes tend to preserve RNA secondary structure, and those mutations that disrupt secondary structural elements may reduce gene expression levels, thereby serving as a functional knockout. In this article, we explore the conservation of secondary structures of mRNA coding regions, a previously unknown factor in bacterial evolution, by comparing the structural consequences of mutations in essential and nonessential Escherichia coli genes accumulated over 40 000 generations in the course of the ‘long-term evolution experiment’. We monitored the extent to which mutations influence minimum free energy (MFE) values, assuming that a substantial change in MFE is indicative of structural perturbation. Our principal finding is that purifying selection tends to eliminate those mutations in essential genes that lead to greater changes of MFE values and, therefore, may be more disruptive for the corresponding mRNA secondary structures. This effect implies that synonymous mutations disrupting mRNA secondary structures may directly affect the fitness of the organism. These results demonstrate that the need to maintain intact mRNA structures imposes additional evolutionary constraints on bacterial genomes, which go beyond preservation of structure and function of the encoded proteins.
Collapse
Affiliation(s)
- Andrey Chursov
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftzentrum Weihenstephan, Maximus-von-Imhof-Forum 3, D-85354, Freising, Germany, Helmholtz Center Munich-German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Ingolstädter Landstraße 1, D-85764 Neuherberg, Germany and Cure Lab, Inc., 43 Rybury Hillway, Needham, MA 02492, USA
| | | | | |
Collapse
|
59
|
Zhang Y, Ponty Y, Blanchette M, Lécuyer E, Waldispühl J. SPARCS: a web server to analyze (un)structured regions in coding RNA sequences. Nucleic Acids Res 2013; 41:W480-5. [PMID: 23748952 PMCID: PMC3692110 DOI: 10.1093/nar/gkt461] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
More than a simple carrier of the genetic information, messenger RNA (mRNA) coding regions can also harbor functional elements that evolved to control different post-transcriptional processes, such as mRNA splicing, localization and translation. Functional elements in RNA molecules are often encoded by secondary structure elements. In this aticle, we introduce Structural Profile Assignment of RNA Coding Sequences (SPARCS), an efficient method to analyze the (secondary) structure profile of protein-coding regions in mRNAs. First, we develop a novel algorithm that enables us to sample uniformly the sequence landscape preserving the dinucleotide frequency and the encoded amino acid sequence of the input mRNA. Then, we use this algorithm to generate a set of artificial sequences that is used to estimate the Z-score of classical structural metrics such as the sum of base pairing probabilities and the base pairing entropy. Finally, we use these metrics to predict structured and unstructured regions in the input mRNA sequence. We applied our methods to study the structural profile of the ASH1 genes and recovered key structural elements. A web server implementing this discovery pipeline is available at http://csb.cs.mcgill.ca/sparcs together with the source code of the sampling algorithm.
Collapse
Affiliation(s)
- Yang Zhang
- School of Computer Science & McGill Centre for Bioinformatics, McGill University, Montréal, H3A 0C6 QC, Canada
| | | | | | | | | |
Collapse
|
60
|
Gu W, Wang X, Zhai C, Zhou T, Xie X. Biological basis of miRNA action when their targets are located in human protein coding region. PLoS One 2013; 8:e63403. [PMID: 23671676 PMCID: PMC3646042 DOI: 10.1371/journal.pone.0063403] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 03/30/2013] [Indexed: 01/09/2023] Open
Abstract
Recent analyses have revealed many functional microRNA (miRNA) targets in mammalian protein coding regions. But, the mechanisms that ensure miRNA function when their target sites are located in protein coding regions of mammalian mRNA transcripts are largely unknown. In this paper, we investigate some potential biological factors, such as target site accessibility and local translation efficiency. We computationally analyze these two factors using experimentally identified miRNA targets in human protein coding region. We find site accessibility is significantly increased in miRNA target region to facilitate miRNA binding. At the mean time, local translation efficiency is also selectively decreased near miRNA target region. GC-poor codons are preferred in the flank region of miRNA target sites to ease the access of miRNA targets. Within-genome analysis shows substantial variations of site accessibility and local translation efficiency among different miRNA targets in the genome. Further analyses suggest target gene’s GC content and conservation level could explain some of the differences in site accessibility. On the other hand, target gene’s functional importance and conservation level can affect local translation efficiency near miRNA target region. We hence propose both site accessibility and local translation efficiency are important in miRNA action when miRNA target sites are located in mammalian protein coding regions.
Collapse
Affiliation(s)
- Wanjun Gu
- Research Center of Learning Sciences, Southeast University, Nanjing, Jiangsu, China
- * E-mail: (WG); (TZ); (XX)
| | - Xiaofei Wang
- Research Center of Learning Sciences, Southeast University, Nanjing, Jiangsu, China
| | - Chuanying Zhai
- Research Center of Learning Sciences, Southeast University, Nanjing, Jiangsu, China
| | - Tong Zhou
- Institute for Personalized Respiratory Medicine, The University of Illinois at Chicago, Chicago, Illinois, United States of America
- Section of Pulmonary, Critical Care, Sleep & Allergy, Department of Medicine, The University of Illinois at Chicago, Chicago, Illinois, United States of America
- * E-mail: (WG); (TZ); (XX)
| | - Xueying Xie
- Research Center of Learning Sciences, Southeast University, Nanjing, Jiangsu, China
- * E-mail: (WG); (TZ); (XX)
| |
Collapse
|
61
|
Woo YH, Li WH. DNA replication timing and selection shape the landscape of nucleotide variation in cancer genomes. Nat Commun 2013; 3:1004. [PMID: 22893128 DOI: 10.1038/ncomms1982] [Citation(s) in RCA: 96] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2012] [Accepted: 06/29/2012] [Indexed: 12/27/2022] Open
Abstract
Cancer cells evolve from normal cells by somatic mutations and natural selection. Comparing the evolution of cancer cells and that of organisms can elucidate the genetic basis of cancer. Here we analyse somatic mutations in >400 cancer genomes. We find that the frequency of somatic single-nucleotide variations increases with replication time during the S phase much more drastically than germ-line single-nucleotide variations and somatic large-scale structural alterations, including amplifications and deletions. The ratio of nonsynonymous to synonymous single-nucleotide variations is higher for cancer cells than for germ-line cells, suggesting weaker purifying selection against somatic mutations. Among genes with recurrent mutations only cancer driver genes show evidence of strong positive selection, and late-replicating regions are depleted of cancer driver genes, although enriched for recurrently mutated genes. These observations show that replication timing has a prominent role in shaping the single-nucleotide variation landscape of cancer cells.
Collapse
Affiliation(s)
- Yong H Woo
- Department of Ecology and Evolution, The University of Chicago, 1101 East 57th Street, Chicago, IL 60637, USA
| | | |
Collapse
|
62
|
Shabalina SA, Spiridonov NA, Kashina A. Sounds of silence: synonymous nucleotides as a key to biological regulation and complexity. Nucleic Acids Res 2013; 41:2073-94. [PMID: 23293005 PMCID: PMC3575835 DOI: 10.1093/nar/gks1205] [Citation(s) in RCA: 187] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Messenger RNA is a key component of an intricate regulatory network of its own. It accommodates numerous nucleotide signals that overlap protein coding sequences and are responsible for multiple levels of regulation and generation of biological complexity. A wealth of structural and regulatory information, which mRNA carries in addition to the encoded amino acid sequence, raises the question of how these signals and overlapping codes are delineated along non-synonymous and synonymous positions in protein coding regions, especially in eukaryotes. Silent or synonymous codon positions, which do not determine amino acid sequences of the encoded proteins, define mRNA secondary structure and stability and affect the rate of translation, folding and post-translational modifications of nascent polypeptides. The RNA level selection is acting on synonymous sites in both prokaryotes and eukaryotes and is more common than previously thought. Selection pressure on the coding gene regions follows three-nucleotide periodic pattern of nucleotide base-pairing in mRNA, which is imposed by the genetic code. Synonymous positions of the coding regions have a higher level of hybridization potential relative to non-synonymous positions, and are multifunctional in their regulatory and structural roles. Recent experimental evidence and analysis of mRNA structure and interspecies conservation suggest that there is an evolutionary tradeoff between selective pressure acting at the RNA and protein levels. Here we provide a comprehensive overview of the studies that define the role of silent positions in regulating RNA structure and processing that exert downstream effects on proteins and their functions.
Collapse
Affiliation(s)
- Svetlana A Shabalina
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20984, USA.
| | | | | |
Collapse
|
63
|
Hiller M, Schaar BT, Bejerano G. Hundreds of conserved non-coding genomic regions are independently lost in mammals. Nucleic Acids Res 2012; 40:11463-76. [PMID: 23042682 PMCID: PMC3526296 DOI: 10.1093/nar/gks905] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Conserved non-protein-coding DNA elements (CNEs) often encode cis-regulatory elements and are rarely lost during evolution. However, CNE losses that do occur can be associated with phenotypic changes, exemplified by pelvic spine loss in sticklebacks. Using a computational strategy to detect complete loss of CNEs in mammalian genomes while strictly controlling for artifacts, we find >600 CNEs that are independently lost in at least two mammalian lineages, including a spinal cord enhancer near GDF11. We observed several genomic regions where multiple independent CNE loss events happened; the most extreme is the DIAPH2 locus. We show that CNE losses often involve deletions and that CNE loss frequencies are non-uniform. Similar to less pleiotropic enhancers, we find that independently lost CNEs are shorter, slightly less constrained and evolutionarily younger than CNEs without detected losses. This suggests that independently lost CNEs are less pleiotropic and that pleiotropic constraints contribute to non-uniform CNE loss frequencies. We also detected 35 CNEs that are independently lost in the human lineage and in other mammals. Our study uncovers an interesting aspect of the evolution of functional DNA in mammalian genomes. Experiments are necessary to test if these independently lost CNEs are associated with parallel phenotype changes in mammals.
Collapse
Affiliation(s)
- Michael Hiller
- Department of Developmental Biology, Stanford University, Stanford, California 94305, USA.
| | | | | |
Collapse
|
64
|
Gu W, Zhai C, Wang X, Xie X, Parinandi G, Zhou T. Translation Efficiency in Upstream Region of microRNA Targets in Arabidopsis thaliana. Evol Bioinform Online 2012; 8:565-74. [PMID: 23071387 PMCID: PMC3469488 DOI: 10.4137/ebo.s10362] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
With respect to upstream regions of microRNA (miRNA) target sites located in protein coding sequences, experimental studies have suggested rare codons, rather than frequent codons, are important for miRNA function, because they slow down the local translational process. But, whether there is a trend of reduced translation efficiency near miRNA targets is still unknown. Using Arabidopsis thaliana, we perform genome-wide analysis of synonymous codon usage in upstream regions of miRNA target sites. At the whole genome level, we find no significant selection signals for decreased translational efficiency. However, the same genome analyses do show substantial variations of translation efficiency reduction among miRNA targets. We find that miRNA conservation level, gene codon usage bias, and the mechanism of miRNA action can account for the differences in translation efficiency. But gene's GC content, gene expression level, and miRNA target's conservation level have no effect on local translation efficiency of miRNA targets. Although local translation efficiency in the upstream region of miRNA targets is related to miRNA function in A. thaliana, the selection signal of rare codon usage in that region is weak. We propose some other biological factors are more important than local translation efficiency in miRNA action when miRNA targets are located in protein coding sequences.
Collapse
Affiliation(s)
- Wanjun Gu
- Key Laboratory of Child Development and Learning Science of Ministry of Education of China, Southeast University, Nanjing, Jiangsu 210096, China
| | | | | | | | | | | |
Collapse
|
65
|
Birnbaum RY, Clowney EJ, Agamy O, Kim MJ, Zhao J, Yamanaka T, Pappalardo Z, Clarke SL, Wenger AM, Nguyen L, Gurrieri F, Everman DB, Schwartz CE, Birk OS, Bejerano G, Lomvardas S, Ahituv N. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res 2012; 22:1059-68. [PMID: 22442009 PMCID: PMC3371700 DOI: 10.1101/gr.133546.111] [Citation(s) in RCA: 166] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2011] [Accepted: 03/19/2012] [Indexed: 01/17/2023]
Abstract
Enhancers are essential gene regulatory elements whose alteration can lead to morphological differences between species, developmental abnormalities, and human disease. Current strategies to identify enhancers focus primarily on noncoding sequences and tend to exclude protein coding sequences. Here, we analyzed 25 available ChIP-seq data sets that identify enhancers in an unbiased manner (H3K4me1, H3K27ac, and EP300) for peaks that overlap exons. We find that, on average, 7% of all ChIP-seq peaks overlap coding exons (after excluding for peaks that overlap with first exons). By using mouse and zebrafish enhancer assays, we demonstrate that several of these exonic enhancer (eExons) candidates can function as enhancers of their neighboring genes and that the exonic sequence is necessary for enhancer activity. Using ChIP, 3C, and DNA FISH, we further show that one of these exonic limb enhancers, Dync1i1 exon 15, has active enhancer marks and physically interacts with Dlx5/6 promoter regions 900 kb away. In addition, its removal by chromosomal abnormalities in humans could cause split hand and foot malformation 1 (SHFM1), a disorder associated with DLX5/6. These results demonstrate that DNA sequences can have a dual function, operating as coding exons in one tissue and enhancers of nearby gene(s) in another tissue, suggesting that phenotypes resulting from coding mutations could be caused not only by protein alteration but also by disrupting the regulation of another gene.
Collapse
Affiliation(s)
- Ramon Y. Birnbaum
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| | - E. Josephine Clowney
- Department of Anatomy
- Program in Biomedical Sciences, University of California, San Francisco, California 94143, USA
| | - Orly Agamy
- The Morris Kahn Laboratory of Human Genetics, NIBN, Ben-Gurion University, Beer-Sheva 84105, Israel
| | - Mee J. Kim
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| | - Jingjing Zhao
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
- Key Laboratory of Advanced Control and Optimization for Chemical Processes of the Ministry of Education, East China University of Science and Technology, Shanghai 200237, China
| | - Takayuki Yamanaka
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| | - Zachary Pappalardo
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| | | | - Aaron M. Wenger
- Department of Computer Science, Stanford University, Stanford, California 94305-5329, USA
| | - Loan Nguyen
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| | - Fiorella Gurrieri
- Istituto di Genetica Medica, Università Cattolica S. Cuore, Rome 00168, Italy
| | - David B. Everman
- JC Self Research Institute, Greenwood Genetic Center, Greenwood, South Carolina 29646, USA
| | - Charles E. Schwartz
- JC Self Research Institute, Greenwood Genetic Center, Greenwood, South Carolina 29646, USA
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina 29634, USA
| | - Ohad S. Birk
- The Morris Kahn Laboratory of Human Genetics, NIBN, Ben-Gurion University, Beer-Sheva 84105, Israel
| | - Gill Bejerano
- Department of Computer Science, Stanford University, Stanford, California 94305-5329, USA
- Department of Developmental Biology, Stanford University, Stanford, California 94305-5329, USA
| | | | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences
- Institute for Human Genetics
| |
Collapse
|
66
|
Michel AM, Choudhury KR, Firth AE, Ingolia NT, Atkins JF, Baranov PV. Observation of dually decoded regions of the human genome using ribosome profiling data. Genome Res 2012; 22:2219-29. [PMID: 22593554 PMCID: PMC3483551 DOI: 10.1101/gr.133249.111] [Citation(s) in RCA: 151] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The recently developed ribosome profiling technique (Ribo-Seq) allows mapping of the locations of translating ribosomes on mRNAs with subcodon precision. When ribosome protected fragments (RPFs) are aligned to mRNA, a characteristic triplet periodicity pattern is revealed. We utilized the triplet periodicity of RPFs to develop a computational method for detecting transitions between reading frames that occur during programmed ribosomal frameshifting or in dual coding regions where the same nucleotide sequence codes for multiple proteins in different reading frames. Application of this method to ribosome profiling data obtained for human cells allowed us to detect several human genes where the same genomic segment is translated in more than one reading frame (from different transcripts as well as from the same mRNA) and revealed the translation of hitherto unpredicted coding open reading frames.
Collapse
|
67
|
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences. BMC Bioinformatics 2012; 13:32. [PMID: 22333114 PMCID: PMC3298695 DOI: 10.1186/1471-2105-13-32] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2011] [Accepted: 02/14/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations. RESULTS We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP. CONCLUSIONS CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.
Collapse
|
68
|
|
69
|
Ahn SJ, Vogel H, Heckel DG. Comparative analysis of the UDP-glycosyltransferase multigene family in insects. INSECT BIOCHEMISTRY AND MOLECULAR BIOLOGY 2012; 42:133-147. [PMID: 22155036 DOI: 10.1016/j.ibmb.2011.11.006] [Citation(s) in RCA: 169] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Revised: 11/26/2011] [Accepted: 11/28/2011] [Indexed: 05/31/2023]
Abstract
UDP-glycosyltransferases (UGT) catalyze the conjugation of a range of diverse small lipophilic compounds with sugars to produce glycosides, playing an important role in the detoxification of xenobiotics and in the regulation of endobiotics in insects. Recent progress in genome sequencing has enabled an assessment of the extent of the UGT multigene family in insects. Here we report over 310 putative UGT genes identified from genomic databases of eight different insect species together with a transcript database from the lepidopteran Helicoverpa armigera. Phylogenetic analysis of the insect UGTs showed Order-specific gene diversification and inter-species conservation of this multigene family. Only one family (UGT50) is found in all insect species surveyed (except the pea aphid) and may be homologous to mammalian UGT8. Three families (UGT31, UGT32, and UGT305) related to Lepidopteran UGTs are unique to baculoviruses. A lepidopteran sub-tree constructed with 40 H. armigera UGTs and 44 Bombyx mori UGTs revealed that lineage-specific expansions of some families in both species appear to be driven by diversification in the N-terminal substrate binding domain, increasing the range of compounds that could be detoxified or regulated by glycosylation. By comparison of the deduced protein sequences, several important domains were predicted, including the N-terminal signal peptide, UGT signature motif, and C-terminal transmembrane domain. Furthermore, several conserved residues putatively involved in sugar donor binding and catalytic mechanism were also identified by comparison with human UGTs. Many UGTs were expressed in fat body, midgut, and Malpighian tubules, consistent with functions in detoxification, and some were expressed in antennae, suggesting a role in pheromone deactivation. Transcript variants derived from alternative splicing, exon skipping, or intron retention produced additional UGT diversity. These findings from this comparative study of two lepidopteran UGTs as well as other insects reveal a diversity comparable to this gene family in vertebrates, plants and fungi and show the magnitude of the task ahead, to determine biochemical function and physiological relevance of each UGT enzyme.
Collapse
Affiliation(s)
- Seung-Joon Ahn
- Department of Entomology, Max Planck Institute for Chemical Ecology, Jena 07745, Germany
| | | | | |
Collapse
|
70
|
Abstract
New studies show that novel long-range enhancers of developmental genes can emerge by exaptation of protein-coding sequences with no previous regulatory function.
Collapse
Affiliation(s)
- David Fredman
- Institute of Clinical Sciences, Imperial College London and MRC Clinical Sciences Centre, Hammersmith Hospital Campus, Du Cane Road, London, UK
| | | | | |
Collapse
|
71
|
|
72
|
A high-resolution map of human evolutionary constraint using 29 mammals. Nature 2011; 478:476-82. [PMID: 21993624 PMCID: PMC3207357 DOI: 10.1038/nature10530] [Citation(s) in RCA: 811] [Impact Index Per Article: 57.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Accepted: 09/05/2011] [Indexed: 01/05/2023]
Abstract
Comparison of related genomes has emerged as a powerful lens for genome interpretation. Here, we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and report constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparison with experimental datasets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events, and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements, and ~1,000 primate- and human-accelerated elements. Overlap with disease-associated variants suggests our findings will be relevant for studies of human biology and health.
Collapse
|